DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
@ 2021-07-27  3:42 Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
                   ` (16 more replies)
  0 siblings, 17 replies; 266+ messages in thread
From: Xueming Li @ 2021-07-27  3:42 UTC (permalink / raw)
  Cc: dev, Viacheslav Ovsiienko, xuemingl, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko

In eth PMD driver model, each RX queue was pre-loaded with mbufs for
saving incoming packets. When number of SF or VF scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

To save memory and speed up, this patch introduces shared RX queue.
Ports with same configuration in a switch domain could share RX queue
set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ. Polling
a member port in shared RX queue receives packets for all member ports.
Source port is identified by mbuf->port.

Queue number of ports in shared group should be identical. Queue index
is 1:1 mapped in shared group.

Shared RX queue is supposed to be polled on same thread.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/rte_ethdev.c | 1 +
 lib/ethdev/rte_ethdev.h | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a1106f5896..632a0e890b 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..5c63751be0 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * RXQ is shared within ports in switch domain to save memory and avoid
+ * polling every port. Any port in group could be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
@ 2021-07-28  7:56 ` Andrew Rybchenko
  2021-07-28  8:20   ` Xueming(Steven) Li
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-07-28  7:56 UTC (permalink / raw)
  To: Xueming Li; +Cc: dev, Viacheslav Ovsiienko, Thomas Monjalon, Ferruh Yigit

On 7/27/21 6:42 AM, Xueming Li wrote:
> In eth PMD driver model, each RX queue was pre-loaded with mbufs for
> saving incoming packets. When number of SF or VF scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
> 
> To save memory and speed up, this patch introduces shared RX queue.
> Ports with same configuration in a switch domain could share RX queue
> set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ. Polling
> a member port in shared RX queue receives packets for all member ports.
> Source port is identified by mbuf->port.
> 
> Queue number of ports in shared group should be identical. Queue index
> is 1:1 mapped in shared group.
> 
> Shared RX queue is supposed to be polled on same thread.
> 
> Multiple groups is supported by group ID.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

It looks like it could be useful to artificial benchmarks, but
absolutely useless for real life. SFs and VFs are used by VMs
(or containers?) to have its own part of HW. If so, SF or VF
Rx and Tx queues live in a VM and cannot be shared.

Sharing makes sense for representors, but it is not mentioned in
the description.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
  2021-07-28  7:56 ` Andrew Rybchenko
@ 2021-07-28  8:20   ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-07-28  8:20 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: dev, Slava Ovsiienko, NBU-Contact-Thomas Monjalon, Ferruh Yigit

Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Wednesday, July 28, 2021 3:57 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@nvidia.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Ferruh
> Yigit <ferruh.yigit@intel.com>
> Subject: Re: [RFC] ethdev: introduce shared Rx queue
> 
> On 7/27/21 6:42 AM, Xueming Li wrote:
> > In eth PMD driver model, each RX queue was pre-loaded with mbufs for
> > saving incoming packets. When number of SF or VF scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > To save memory and speed up, this patch introduces shared RX queue.
> > Ports with same configuration in a switch domain could share RX queue
> > set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ.
> > Polling a member port in shared RX queue receives packets for all member ports.
> > Source port is identified by mbuf->port.
> >
> > Queue number of ports in shared group should be identical. Queue index
> > is 1:1 mapped in shared group.
> >
> > Shared RX queue is supposed to be polled on same thread.
> >
> > Multiple groups is supported by group ID.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> It looks like it could be useful to artificial benchmarks, but absolutely useless for real life. SFs and VFs are used by VMs (or containers?)
> to have its own part of HW. If so, SF or VF Rx and Tx queues live in a VM and cannot be shared.

Thanks for looking at this! Agree, SF and VF can't be shared.

> 
> Sharing makes sense for representors, but it is not mentioned in the description.

Yes, the major target is representors, ports in same switch domain, I'll emphasis this in next version.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
@ 2021-08-09 11:47 ` Xueming Li
  2021-08-09 13:50   ` Jerin Jacob
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-08-09 11:47 UTC (permalink / raw)
  Cc: dev, xuemingl, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue is supposed to be polled on same thread.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 9d95cd11e1..1361ff759a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
@ 2021-08-09 13:50   ` Jerin Jacob
  2021-08-09 14:16     ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-09 13:50 UTC (permalink / raw)
  To: Xueming Li; +Cc: dpdk-dev, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue is supposed to be polled on same thread.
>
> Multiple groups is supported by group ID.

Is this offload specific to the representor? If so can this name be
changed specifically to representor?
If it is for a generic case, how the flow ordering will be maintained?

>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index a96e12d155..2e2a9b1554 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4..ebeb4c1851 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c80..45bf5a3a10 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> +  is present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of shared Rx queue can return
> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 9d95cd11e1..1361ff759a 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 13:50   ` Jerin Jacob
@ 2021-08-09 14:16     ` Xueming(Steven) Li
  2021-08-11  8:02       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-09 14:16 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

Hi,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 9, 2021 9:51 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue is supposed to be polled on same thread.
> >
> > Multiple groups is supported by group ID.
> 
> Is this offload specific to the representor? If so can this name be changed specifically to representor?

Yes, PF and representor in switch domain could take advantage.

> If it is for a generic case, how the flow ordering will be maintained?

Not quite sure that I understood your question. The control path of is almost same as before,
PF and representor port still needed, rte flows not impacted.
Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
in my PMD implementation.

> 
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> >
> > diff --git a/doc/guides/nics/features.rst
> > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >
> >
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini
> > b/doc/guides/nics/features/default.ini
> > index 754184ddd4..ebeb4c1851 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c80..45bf5a3a10 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >
> > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >
> > +- Memory usage of representors is huge when number of representor
> > +grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > +  Polling the large number of ports brings more CPU load, cache miss
> > +and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF
> > +and
> > +  representors in same switch domain.
> > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > +  is present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to
> > +enable
> > +  shared Rx queue. Polling any member port of shared Rx queue can
> > +return
> > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 9d95cd11e1..1361ff759a 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> >
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > +       uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 14:16     ` Xueming(Steven) Li
@ 2021-08-11  8:02       ` Jerin Jacob
  2021-08-11  8:28         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-11  8:02 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Monday, August 9, 2021 9:51 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue is supposed to be polled on same thread.
> > >
> > > Multiple groups is supported by group ID.
> >
> > Is this offload specific to the representor? If so can this name be changed specifically to representor?
>
> Yes, PF and representor in switch domain could take advantage.
>
> > If it is for a generic case, how the flow ordering will be maintained?
>
> Not quite sure that I understood your question. The control path of is almost same as before,
> PF and representor port still needed, rte flows not impacted.
> Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
> in my PMD implementation.

My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
multiple ethdev receive queues land into the same receive queue, In that case,
how the flow order is maintained for respective receive queues.
If this offload is only useful for representor case, Can we make this
offload specific
to representor the case by changing its name and scope.



>
> >
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > ---
> > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > >  doc/guides/nics/features/default.ini            |  1 +
> > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > >  5 files changed, 30 insertions(+)
> > >
> > > diff --git a/doc/guides/nics/features.rst
> > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > --- a/doc/guides/nics/features.rst
> > > +++ b/doc/guides/nics/features.rst
> > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >
> > >
> > > +.. _nic_features_shared_rx_queue:
> > > +
> > > +Shared Rx queue
> > > +---------------
> > > +
> > > +Supports shared Rx queue for ports in same switch domain.
> > > +
> > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > +* **[provides] mbuf**: ``mbuf.port``.
> > > +
> > > +
> > >  .. _nic_features_packet_type_parsing:
> > >
> > >  Packet type parsing
> > > diff --git a/doc/guides/nics/features/default.ini
> > > b/doc/guides/nics/features/default.ini
> > > index 754184ddd4..ebeb4c1851 100644
> > > --- a/doc/guides/nics/features/default.ini
> > > +++ b/doc/guides/nics/features/default.ini
> > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >  Queue start/stop     =
> > >  Runtime Rx queue setup =
> > >  Runtime Tx queue setup =
> > > +Shared Rx queue      =
> > >  Burst mode info      =
> > >  Power mgmt address monitor =
> > >  MTU update           =
> > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > b/doc/guides/prog_guide/switch_representation.rst
> > > index ff6aa91c80..45bf5a3a10 100644
> > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >  .. [1] `Ethernet switch device driver model (switchdev)
> > >
> > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > >
> > > +- Memory usage of representors is huge when number of representor
> > > +grows,
> > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > +  Polling the large number of ports brings more CPU load, cache miss
> > > +and
> > > +  latency. Shared Rx queue can be used to share Rx queue between PF
> > > +and
> > > +  representors in same switch domain.
> > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > +  is present in Rx offloading capability of device info. Setting the
> > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > +enable
> > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > +return
> > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > +
> > >  Basic SR-IOV
> > >  ------------
> > >
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 9d95cd11e1..1361ff759a 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -127,6 +127,7 @@ static const struct {
> > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >  };
> > >
> > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > +       uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> > >         /**
> > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >          * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >
> > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11  8:02       ` Jerin Jacob
@ 2021-08-11  8:28         ` Xueming(Steven) Li
  2021-08-11 12:04           ` Ferruh Yigit
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-11  8:28 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 11, 2021 4:03 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Monday, August 9, 2021 9:51 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue is supposed to be polled on same thread.
> > > >
> > > > Multiple groups is supported by group ID.
> > >
> > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >
> > Yes, PF and representor in switch domain could take advantage.
> >
> > > If it is for a generic case, how the flow ordering will be maintained?
> >
> > Not quite sure that I understood your question. The control path of is
> > almost same as before, PF and representor port still needed, rte flows not impacted.
> > Queues still needed for each member port, descriptors(mbuf) will be
> > supplied from shared Rx queue in my PMD implementation.
> 
> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> receive queue, In that case, how the flow order is maintained for respective receive queues.

I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
all forwarding engine. Will sent patches soon.

> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> scope.

It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.

> 
> 
> >
> > >
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > ---
> > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > >  doc/guides/nics/features/default.ini            |  1 +
> > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > >  5 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/doc/guides/nics/features.rst
> > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > --- a/doc/guides/nics/features.rst
> > > > +++ b/doc/guides/nics/features.rst
> > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > >
> > > >
> > > > +.. _nic_features_shared_rx_queue:
> > > > +
> > > > +Shared Rx queue
> > > > +---------------
> > > > +
> > > > +Supports shared Rx queue for ports in same switch domain.
> > > > +
> > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > +
> > > > +
> > > >  .. _nic_features_packet_type_parsing:
> > > >
> > > >  Packet type parsing
> > > > diff --git a/doc/guides/nics/features/default.ini
> > > > b/doc/guides/nics/features/default.ini
> > > > index 754184ddd4..ebeb4c1851 100644
> > > > --- a/doc/guides/nics/features/default.ini
> > > > +++ b/doc/guides/nics/features/default.ini
> > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > >  Queue start/stop     =
> > > >  Runtime Rx queue setup =
> > > >  Runtime Tx queue setup =
> > > > +Shared Rx queue      =
> > > >  Burst mode info      =
> > > >  Power mgmt address monitor =
> > > >  MTU update           =
> > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > index ff6aa91c80..45bf5a3a10 100644
> > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > >
> > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > >`_
> > > >
> > > > +- Memory usage of representors is huge when number of representor
> > > > +grows,
> > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > +  Polling the large number of ports brings more CPU load, cache
> > > > +miss and
> > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > +PF and
> > > > +  representors in same switch domain.
> > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > +  is present in Rx offloading capability of device info. Setting
> > > > +the
> > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > +enable
> > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > +return
> > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > +
> > > >  Basic SR-IOV
> > > >  ------------
> > > >
> > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > index 9d95cd11e1..1361ff759a 100644
> > > > --- a/lib/ethdev/rte_ethdev.c
> > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > @@ -127,6 +127,7 @@ static const struct {
> > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > >  };
> > > >
> > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > > >         /**
> > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > >          * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > >
> > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11  8:28         ` Xueming(Steven) Li
@ 2021-08-11 12:04           ` Ferruh Yigit
  2021-08-11 12:59             ` Xueming(Steven) Li
  2021-09-26  5:35             ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Ferruh Yigit @ 2021-08-11 12:04 UTC (permalink / raw)
  To: Xueming(Steven) Li, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> 
> 
>> -----Original Message-----
>> From: Jerin Jacob <jerinjacobk@gmail.com>
>> Sent: Wednesday, August 11, 2021 4:03 PM
>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
>> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>
>> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>>>
>>> Hi,
>>>
>>>> -----Original Message-----
>>>> From: Jerin Jacob <jerinjacobk@gmail.com>
>>>> Sent: Monday, August 9, 2021 9:51 PM
>>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
>>>> <andrew.rybchenko@oktetlabs.ru>
>>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>>>
>>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>>>>>
>>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
>>>>> for incoming packets. When number of representors scale out in a
>>>>> switch domain, the memory consumption became significant. Most
>>>>> important, polling all ports leads to high cache miss, high
>>>>> latency and low throughput.
>>>>>
>>>>> This patch introduces shared RX queue. Ports with same
>>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
>>>>> Polling any queue using same shared RX queue receives packets from
>>>>> all member ports. Source port is identified by mbuf->port.
>>>>>
>>>>> Port queue number in a shared group should be identical. Queue
>>>>> index is
>>>>> 1:1 mapped in shared group.
>>>>>
>>>>> Share RX queue is supposed to be polled on same thread.
>>>>>
>>>>> Multiple groups is supported by group ID.
>>>>
>>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
>>>
>>> Yes, PF and representor in switch domain could take advantage.
>>>
>>>> If it is for a generic case, how the flow ordering will be maintained?
>>>
>>> Not quite sure that I understood your question. The control path of is
>>> almost same as before, PF and representor port still needed, rte flows not impacted.
>>> Queues still needed for each member port, descriptors(mbuf) will be
>>> supplied from shared Rx queue in my PMD implementation.
>>
>> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
>> receive queue, In that case, how the flow order is maintained for respective receive queues.
> 
> I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> all forwarding engine. Will sent patches soon.
> 

All ports will put the packets in to the same queue (share queue), right? Does
this means only single core will poll only, what will happen if there are
multiple cores polling, won't it cause problem?

And if this requires specific changes in the application, I am not sure about
the solution, can't this work in a transparent way to the application?

Overall, is this for optimizing memory for the port represontors? If so can't we
have a port representor specific solution, reducing scope can reduce the
complexity it brings?

>> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
>> scope.
> 
> It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> 
>>
>>
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>>>>> ---
>>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
>>>>>  doc/guides/nics/features/default.ini            |  1 +
>>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
>>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>>>>>  5 files changed, 30 insertions(+)
>>>>>
>>>>> diff --git a/doc/guides/nics/features.rst
>>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
>>>>> --- a/doc/guides/nics/features.rst
>>>>> +++ b/doc/guides/nics/features.rst
>>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>>>>>
>>>>>
>>>>> +.. _nic_features_shared_rx_queue:
>>>>> +
>>>>> +Shared Rx queue
>>>>> +---------------
>>>>> +
>>>>> +Supports shared Rx queue for ports in same switch domain.
>>>>> +
>>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
>>>>> +* **[provides] mbuf**: ``mbuf.port``.
>>>>> +
>>>>> +
>>>>>  .. _nic_features_packet_type_parsing:
>>>>>
>>>>>  Packet type parsing
>>>>> diff --git a/doc/guides/nics/features/default.ini
>>>>> b/doc/guides/nics/features/default.ini
>>>>> index 754184ddd4..ebeb4c1851 100644
>>>>> --- a/doc/guides/nics/features/default.ini
>>>>> +++ b/doc/guides/nics/features/default.ini
>>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>>>>>  Queue start/stop     =
>>>>>  Runtime Rx queue setup =
>>>>>  Runtime Tx queue setup =
>>>>> +Shared Rx queue      =
>>>>>  Burst mode info      =
>>>>>  Power mgmt address monitor =
>>>>>  MTU update           =
>>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
>>>>> b/doc/guides/prog_guide/switch_representation.rst
>>>>> index ff6aa91c80..45bf5a3a10 100644
>>>>> --- a/doc/guides/prog_guide/switch_representation.rst
>>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
>>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>>>>>  .. [1] `Ethernet switch device driver model (switchdev)
>>>>>
>>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
>>>>>> `_
>>>>>
>>>>> +- Memory usage of representors is huge when number of representor
>>>>> +grows,
>>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
>>>>> +  Polling the large number of ports brings more CPU load, cache
>>>>> +miss and
>>>>> +  latency. Shared Rx queue can be used to share Rx queue between
>>>>> +PF and
>>>>> +  representors in same switch domain.
>>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
>>>>> +  is present in Rx offloading capability of device info. Setting
>>>>> +the
>>>>> +  offloading flag in device Rx mode or Rx queue configuration to
>>>>> +enable
>>>>> +  shared Rx queue. Polling any member port of shared Rx queue can
>>>>> +return
>>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
>>>>> +
>>>>>  Basic SR-IOV
>>>>>  ------------
>>>>>
>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>> index 9d95cd11e1..1361ff759a 100644
>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>> @@ -127,6 +127,7 @@ static const struct {
>>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>>>>>  };
>>>>>
>>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
>>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>>>> index d2b27c351f..a578c9db9d 100644
>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
>>>>> */
>>>>> +       uint32_t shared_group; /**< Shared port group index in
>>>>> + switch domain. */
>>>>>         /**
>>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>>>>>          * Only offloads set on rx_queue_offload_capa or
>>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>>>>> +/**
>>>>> + * Rx queue is shared among ports in same switch domain to save
>>>>> +memory,
>>>>> + * avoid polling each port. Any port in group can be used to receive packets.
>>>>> + * Real source port number saved in mbuf->port field.
>>>>> + */
>>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>>>>>
>>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
>>>>> --
>>>>> 2.25.1
>>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:04           ` Ferruh Yigit
@ 2021-08-11 12:59             ` Xueming(Steven) Li
  2021-08-12 14:35               ` Xueming(Steven) Li
  2021-09-15 15:34               ` Xueming(Steven) Li
  2021-09-26  5:35             ` Xueming(Steven) Li
  1 sibling, 2 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-11 12:59 UTC (permalink / raw)
  To: Ferruh Yigit, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, August 11, 2021 8:04 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jerin Jacob <jerinjacobk@gmail.com>
> >> Sent: Wednesday, August 11, 2021 4:03 PM
> >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >> <andrew.rybchenko@oktetlabs.ru>
> >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >>
> >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>>> -----Original Message-----
> >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> >>>> Sent: Monday, August 9, 2021 9:51 PM
> >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >>>> <andrew.rybchenko@oktetlabs.ru>
> >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> >>>> queue
> >>>>
> >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >>>>>
> >>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
> >>>>> for incoming packets. When number of representors scale out in a
> >>>>> switch domain, the memory consumption became significant. Most
> >>>>> important, polling all ports leads to high cache miss, high
> >>>>> latency and low throughput.
> >>>>>
> >>>>> This patch introduces shared RX queue. Ports with same
> >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> >>>>> Polling any queue using same shared RX queue receives packets from
> >>>>> all member ports. Source port is identified by mbuf->port.
> >>>>>
> >>>>> Port queue number in a shared group should be identical. Queue
> >>>>> index is
> >>>>> 1:1 mapped in shared group.
> >>>>>
> >>>>> Share RX queue is supposed to be polled on same thread.
> >>>>>
> >>>>> Multiple groups is supported by group ID.
> >>>>
> >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >>>
> >>> Yes, PF and representor in switch domain could take advantage.
> >>>
> >>>> If it is for a generic case, how the flow ordering will be maintained?
> >>>
> >>> Not quite sure that I understood your question. The control path of
> >>> is almost same as before, PF and representor port still needed, rte flows not impacted.
> >>> Queues still needed for each member port, descriptors(mbuf) will be
> >>> supplied from shared Rx queue in my PMD implementation.
> >>
> >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> >> offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> respective receive queues.
> >
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to
> > process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> forwarding engine. Will sent patches soon.
> >
> 
> All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> happen if there are multiple cores polling, won't it cause problem?

This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.

If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
could be polled on multiple cores.

It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.

If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
as a dedicate proxy for the group.

> 
> And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> application?

Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling. 
This can be done as a wrapper PMD later, more efforts.

> 
> Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> scope can reduce the complexity it brings?

This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also 
introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.

It's great if any new solution/suggestion, my head buried in PMD code :)

> 
> >> If this offload is only useful for representor case, Can we make this
> >> offload specific to representor the case by changing its name and scope.
> >
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> >
> >>
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> >>>>> ---
> >>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
> >>>>>  doc/guides/nics/features/default.ini            |  1 +
> >>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
> >>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >>>>>  5 files changed, 30 insertions(+)
> >>>>>
> >>>>> diff --git a/doc/guides/nics/features.rst
> >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> >>>>> --- a/doc/guides/nics/features.rst
> >>>>> +++ b/doc/guides/nics/features.rst
> >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >>>>>
> >>>>>
> >>>>> +.. _nic_features_shared_rx_queue:
> >>>>> +
> >>>>> +Shared Rx queue
> >>>>> +---------------
> >>>>> +
> >>>>> +Supports shared Rx queue for ports in same switch domain.
> >>>>> +
> >>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> >>>>> +
> >>>>> +
> >>>>>  .. _nic_features_packet_type_parsing:
> >>>>>
> >>>>>  Packet type parsing
> >>>>> diff --git a/doc/guides/nics/features/default.ini
> >>>>> b/doc/guides/nics/features/default.ini
> >>>>> index 754184ddd4..ebeb4c1851 100644
> >>>>> --- a/doc/guides/nics/features/default.ini
> >>>>> +++ b/doc/guides/nics/features/default.ini
> >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >>>>>  Queue start/stop     =
> >>>>>  Runtime Rx queue setup =
> >>>>>  Runtime Tx queue setup =
> >>>>> +Shared Rx queue      =
> >>>>>  Burst mode info      =
> >>>>>  Power mgmt address monitor =
> >>>>>  MTU update           =
> >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> >>>>> b/doc/guides/prog_guide/switch_representation.rst
> >>>>> index ff6aa91c80..45bf5a3a10 100644
> >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >>>>>  .. [1] `Ethernet switch device driver model (switchdev)
> >>>>>
> >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> >>>>>> `_
> >>>>>
> >>>>> +- Memory usage of representors is huge when number of representor
> >>>>> +grows,
> >>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> >>>>> +  Polling the large number of ports brings more CPU load, cache
> >>>>> +miss and
> >>>>> +  latency. Shared Rx queue can be used to share Rx queue between
> >>>>> +PF and
> >>>>> +  representors in same switch domain.
> >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> >>>>> +  is present in Rx offloading capability of device info. Setting
> >>>>> +the
> >>>>> +  offloading flag in device Rx mode or Rx queue configuration to
> >>>>> +enable
> >>>>> +  shared Rx queue. Polling any member port of shared Rx queue can
> >>>>> +return
> >>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> >>>>> +
> >>>>>  Basic SR-IOV
> >>>>>  ------------
> >>>>>
> >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>> index 9d95cd11e1..1361ff759a 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>> @@ -127,6 +127,7 @@ static const struct {
> >>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> >>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >>>>>  };
> >>>>>
> >>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
> >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> >>>>> index d2b27c351f..a578c9db9d 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.h
> >>>>> +++ b/lib/ethdev/rte_ethdev.h
> >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> >>>>> */
> >>>>> +       uint32_t shared_group; /**< Shared port group index in
> >>>>> + switch domain. */
> >>>>>         /**
> >>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >>>>>          * Only offloads set on rx_queue_offload_capa or
> >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> >>>>> +/**
> >>>>> + * Rx queue is shared among ports in same switch domain to save
> >>>>> +memory,
> >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> >>>>> + * Real source port number saved in mbuf->port field.
> >>>>> + */
> >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >>>>>
> >>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> >>>>> --
> >>>>> 2.25.1
> >>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
@ 2021-08-11 14:04 ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
                     ` (14 more replies)
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                   ` (13 subsequent siblings)
  16 siblings, 15 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Jerin Jacob, Ferruh Yigit, Thomas Monjalon,
	Andrew Rybchenko

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 193f0d8295..058f5c88d9 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

In case of shared Rx queue, port number of mbufs returned from one rx
burst could be different.

To support shared Rx queue, this patch dumps mbuf->port and queue for
each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 5dd7157947..1733d5e663 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,7 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ", mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core Xueming Li
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, RX queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             |  4 ++++
 app/test-pmd/testpmd.c                | 14 ++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  5 +++++
 5 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 31d8ba1b91..bb882a56a4 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 7c13210f04..a466a20bfb 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -166,6 +166,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: share rxq between PF and representors\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -602,6 +603,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			0, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1256,6 +1258,8 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share"))
+				rxq_share = 1;
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 6cbe9ba3c8..67fd128862 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -223,6 +223,9 @@ uint8_t  rx_pkt_nb_segs; /**< Number of segments to split */
 uint16_t rx_pkt_seg_offsets[MAX_SEGS_BUFFER_SPLIT];
 uint8_t  rx_pkt_nb_offs; /**< Number of specified offsets */
 
+uint8_t rxq_share;
+/**< Create shared rxq for PF and representors. */
+
 /*
  * Configuration of packet segments used by the "txonly" processing engine.
  */
@@ -1441,6 +1444,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3334,6 +3342,12 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 16a3598e48..f3b1d34e28 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint8_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern uint16_t mb_mempool_cache;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6061674239..8a9aeeb11f 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -384,6 +384,11 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share``
+
+    Create all queues in shared RX queue mode if device supports, queues in
+    same switch domain are shared according queue ID.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue Xueming Li
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Shared rxqs uses one set rx queue internally, queues must be polled from
one core.

Stops forwarding if shared rxq being scheduled on multiple cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 91 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index bb882a56a4..51f7d26045 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,97 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			printf("Shared RX queue can't be scheduled on different cores:\n");
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 67fd128862..d941bd982e 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2169,10 +2169,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f3b1d34e28..6497c56359 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -785,6 +786,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (2 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

By enabling shared Rx queue, received packets come from all member ports
in same shared Rx queue.

This patch adds a common forwarding function for shared Rx queue, groups
source forwarding stream by looking up local streams on current lcore
with packet source port(mbuf->port) and queue, then invokes callback to
handle received packets for source stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/testpmd.c | 69 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.h |  4 +++
 2 files changed, 73 insertions(+)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index d941bd982e..f46bd97948 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2034,6 +2034,75 @@ flush_fwd_rx_queues(void)
 	}
 }
 
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_by_port(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		struct rte_mbuf **pkts, packet_fwd_cb fwd)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		fwd(fs, nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i + 1 == nb_rx || pkts_burst[i + 1]->port != port) {
+			/* Forward packets with same source port. */
+			forward_by_port(fs, port, nb_fs_rx,
+					&pkts_burst[i + 1 - nb_fs_rx], fwd);
+			nb_fs_rx = 1;
+		} else {
+			nb_fs_rx++;
+		}
+	}
+}
+
 static void
 run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 {
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 6497c56359..13141dfed9 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -272,6 +272,8 @@ struct fwd_lcore {
 typedef void (*port_fwd_begin_t)(portid_t pi);
 typedef void (*port_fwd_end_t)(portid_t pi);
 typedef void (*packet_fwd_t)(struct fwd_stream *fs);
+typedef void (*packet_fwd_cb)(struct fwd_stream *fs, uint16_t nb_rx,
+			      struct rte_mbuf **pkts);
 
 struct fwd_engine {
 	const char       *fwd_mode_name; /**< Forwarding mode name. */
@@ -897,6 +899,8 @@ char *list_pkt_forwarding_modes(void);
 char *list_pkt_forwarding_retry_modes(void);
 void set_pkt_forwarding_mode(const char *fwd_mode);
 void start_packet_forwarding(int with_tx_first);
+void forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+			struct rte_mbuf **pkts_burst, packet_fwd_cb fwd);
 void fwd_stats_display(void);
 void fwd_stats_reset(void);
 void stop_packet_forwarding(void);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (3 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-17  9:37     ` Jerin Jacob
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding Xueming Li
                     ` (9 subsequent siblings)
  14 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Added an inline common wrapper function for all fwd engines
which do the following in common:

1. get_start_cycles
2. rte_eth_rx_burst(...,nb_pkt_per_burst)
3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
4. get_end_cycle

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 13141dfed9..b685ac48d6 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
 void remove_tx_dynf_callback(portid_t portid);
 int update_jumbo_frame_offload(portid_t portid);
 
+static inline void
+do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+
+	/*
+	 * Receive a burst of packets and forward them.
+	 */
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
+			pkts_burst, nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	if (unlikely(rxq_share > 0))
+		forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
+	else
+		(*fwd)(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
 /*
  * Work-around of a compilation error with ICC on invocations of the
  * rte_be_to_cpu_16() function.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (4 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding Xueming Li
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Supports shared Rx queue.

If shared Rx queue is enabled, group received packets by stream
according to mbuf->port value and then and forward in stream basis as
before.

If shared Rx queue is not enabled, just forward in stream basis.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/iofwd.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/iofwd.c b/app/test-pmd/iofwd.c
index 83d098adcb..316a80d65c 100644
--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -44,25 +44,11 @@
  * to packets data.
  */
 static void
-pkt_burst_io_forward(struct fwd_stream *fs)
+io_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		  struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-	fs->rx_packets += nb_rx;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
 			pkts_burst, nb_rx);
@@ -85,8 +71,15 @@ pkt_burst_io_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_io_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, io_forward_stream);
 }
 
 struct fwd_engine io_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (5 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd Xueming Li
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Supports shared Rx queue.

If shared Rx queue is enabled, group received packets by stream
according to mbuf->port value and then and forward in stream basis as
before.

If shared Rx queue is not enabled, just forward in stream basis.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/rxonly.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index c78fc4609a..80ae0ecf93 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -41,32 +41,24 @@
 #include "testpmd.h"
 
 /*
- * Received a burst of packets.
+ * Process a burst of received packets from same stream.
  */
 static void
-pkt_burst_receive(struct fwd_stream *fs)
+rxonly_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		      struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
-	uint16_t i;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
+	RTE_SET_USED(fs);
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
 
-	fs->rx_packets += nb_rx;
-	for (i = 0; i < nb_rx; i++)
-		rte_pktmbuf_free(pkts_burst[i]);
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_receive(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, rxonly_forward_stream);
 }
 
 struct fwd_engine rx_only_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (6 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd Xueming Li
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/icmpecho.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 8948f28eb5..d6d11a2efb 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -267,13 +267,13 @@ ipv4_hdr_cksum(struct rte_ipv4_hdr *ip_h)
 	(((rte_be_to_cpu_32((ipv4_addr)) >> 24) & 0x000000FF) == 0xE0)
 
 /*
- * Receive a burst of packets, lookup for ICMP echo requests, and, if any,
- * send back ICMP echo replies.
+ * Lookup for ICMP echo requests in received mbuf and, if any,
+ * send back ICMP echo replies to corresponding Tx port.
  */
 static void
-reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+reply_to_icmp_echo_rqsts_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *pkt;
 	struct rte_ether_hdr *eth_h;
 	struct rte_vlan_hdr *vlan_h;
@@ -283,7 +283,6 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	struct rte_ether_addr eth_addr;
 	uint32_t retry;
 	uint32_t ip_addr;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_replies;
 	uint16_t eth_type;
@@ -291,22 +290,9 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	uint16_t arp_op;
 	uint16_t arp_pro;
 	uint32_t cksum;
-	uint8_t  i;
+	uint16_t  i;
 	int l2_len;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * First, receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		if (likely(i < nb_rx - 1))
@@ -509,8 +495,15 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 			} while (++nb_tx < nb_replies);
 		}
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, reply_to_icmp_echo_rqsts_stream);
 }
 
 struct fwd_engine icmp_echo_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (7 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen Xueming Li
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/csumonly.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 607c889359..3b7fb35843 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -763,7 +763,7 @@ pkt_copy_split(const struct rte_mbuf *pkt)
 }
 
 /*
- * Receive a burst of packets, and for each packet:
+ * For each packet in received mbuf:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
  *  - reprocess the checksum of all supported layers. This is done in SW
@@ -792,9 +792,9 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * OUTER_IP is only useful for tunnel packets.
  */
 static void
-pkt_burst_checksum_forward(struct fwd_stream *fs)
+checksum_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *gso_segments[GSO_MAX_PKT_BURST];
 	struct rte_gso_ctx *gso_ctx;
 	struct rte_mbuf **tx_pkts_burst;
@@ -805,7 +805,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	void **gro_ctx;
 	uint16_t gro_pkts_num;
 	uint8_t gro_enable;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_prep;
 	uint16_t i;
@@ -820,18 +819,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_segments = 0;
 	int ret;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/* receive a burst of packet */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	rx_bad_ip_csum = 0;
 	rx_bad_l4_csum = 0;
 	rx_bad_outer_l4_csum = 0;
@@ -1139,8 +1126,15 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(tx_pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_checksum_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, checksum_forward_stream);
 }
 
 struct fwd_engine csum_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (8 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd Xueming Li
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then do stats in stream basis (as before).

If shared rxq is not enabled, just as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/flowgen.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index 3bf6e1ce97..d74c302b1c 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -83,10 +83,10 @@ ip_sum(const alias_int16_t *hdr, int hdr_len)
  * still do so in order to maintain traffic statistics.
  */
 static void
-pkt_burst_flow_gen(struct fwd_stream *fs)
+flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
 	unsigned pkt_size = tx_pkt_length - 4;	/* Adjust FCS */
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_mempool *mbp;
 	struct rte_mbuf  *pkt = NULL;
 	struct rte_ether_hdr *eth_hdr;
@@ -94,23 +94,14 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	struct rte_udp_hdr *udp_hdr;
 	uint16_t vlan_tci, vlan_tci_outer;
 	uint64_t ol_flags = 0;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_pkt;
 	uint16_t nb_clones = nb_pkt_flowgen_clones;
 	uint16_t i;
 	uint32_t retry;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 	static int next_flow = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/* Receive a burst of packets and discard them. */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	fs->rx_packets += nb_rx;
-
 	for (i = 0; i < nb_rx; i++)
 		rte_pktmbuf_free(pkts_burst[i]);
 
@@ -213,8 +204,15 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_pkt);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine
+ */
+static void
+pkt_burst_flow_gen(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, flow_gen_stream);
 }
 
 struct fwd_engine flow_gen_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (9 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd Xueming Li
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/macfwd.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 0568ea794d..75fbea16d4 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -44,32 +44,18 @@
  * before forwarding them.
  */
 static void
-pkt_burst_mac_forward(struct fwd_stream *fs)
+mac_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf  *mb;
 	struct rte_ether_hdr *eth_hdr;
 	uint32_t retry;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
 	uint64_t ol_flags = 0;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	tx_offloads = txp->dev_conf.txmode.offloads;
 	if (tx_offloads	& DEV_TX_OFFLOAD_VLAN_INSERT)
@@ -116,8 +102,15 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_mac_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, mac_forward_stream);
 }
 
 struct fwd_engine mac_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (10 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd Xueming Li
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/macswap.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 310bca06af..daf7170092 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -50,27 +50,13 @@
  * addresses of packets before forwarding them.
  */
 static void
-pkt_burst_mac_swap(struct fwd_stream *fs)
+mac_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 
 	do_macswap(pkts_burst, nb_rx, txp);
@@ -95,7 +81,15 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+}
+
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_mac_swap(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, mac_swap_stream);
 }
 
 struct fwd_engine mac_swap_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (11 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd Xueming Li
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/5tswap.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
index e8cef9623b..236a117ee3 100644
--- a/app/test-pmd/5tswap.c
+++ b/app/test-pmd/5tswap.c
@@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
  * Parses each layer and swaps it. When the next layer doesn't match it stops.
  */
 static void
-pkt_burst_5tuple_swap(struct fwd_stream *fs)
+_5tuple_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf *mb;
 	uint16_t next_proto;
 	uint64_t ol_flags;
 	uint16_t proto;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-
 	int i;
 	union {
 		struct rte_ether_hdr *eth;
@@ -105,20 +103,6 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 		uint8_t *byte;
 	} h;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
 	vlan_qinq_set(pkts_burst, nb_rx, ol_flags,
@@ -182,7 +166,15 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+}
+
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_5tuple_swap(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, _5tuple_swap_stream);
 }
 
 struct fwd_engine five_tuple_swap_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (12 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/ieee1588fwd.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index 034f238c34..dc6bf0e39d 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -90,23 +90,17 @@ port_ieee1588_tx_timestamp_check(portid_t pi)
 }
 
 static void
-ieee1588_packet_fwd(struct fwd_stream *fs)
+ieee1588_fwd_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkt)
 {
-	struct rte_mbuf  *mb;
+	struct rte_mbuf *mb = (*pkt);
 	struct rte_ether_hdr *eth_hdr;
 	struct rte_ether_addr addr;
 	struct ptpv2_msg *ptp_hdr;
 	uint16_t eth_type;
 	uint32_t timesync_index;
 
-	/*
-	 * Receive 1 packet at a time.
-	 */
-	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
-		return;
-
-	fs->rx_packets += 1;
-
+	RTE_SET_USED(nb_rx);
 	/*
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
@@ -198,6 +192,22 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	port_ieee1588_tx_timestamp_check(fs->rx_port);
 }
 
+/*
+ * Wrapper of real fwd ingine.
+ */
+static void
+ieee1588_packet_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *mb;
+
+	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
+		return;
+	if (unlikely(rxq_share > 0))
+		forward_shared_rxq(fs, 1, &mb, ieee1588_fwd_stream);
+	else
+		ieee1588_fwd_stream(fs, 1, &mb);
+}
+
 static void
 port_ieee1588_fwd_begin(portid_t pi)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:59             ` Xueming(Steven) Li
@ 2021-08-12 14:35               ` Xueming(Steven) Li
  2021-09-15 15:34               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-12 14:35 UTC (permalink / raw)
  To: Xueming(Steven) Li, Ferruh Yigit, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Wednesday, August 11, 2021 8:59 PM
> To: Ferruh Yigit <ferruh.yigit@intel.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> 
> 
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob
> > <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >> Sent: Wednesday, August 11, 2021 4:03 PM
> > >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > >> <andrew.rybchenko@oktetlabs.ru>
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >> queue
> > >>
> > >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >>>> Sent: Monday, August 9, 2021 9:51 PM
> > >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > >>>> <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > >>>> <thomas@monjalon.net>; Andrew Rybchenko
> > >>>> <andrew.rybchenko@oktetlabs.ru>
> > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >>>> queue
> > >>>>
> > >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >>>>>
> > >>>>> In current DPDK framework, each RX queue is pre-loaded with
> > >>>>> mbufs for incoming packets. When number of representors scale
> > >>>>> out in a switch domain, the memory consumption became
> > >>>>> significant. Most important, polling all ports leads to high
> > >>>>> cache miss, high latency and low throughput.
> > >>>>>
> > >>>>> This patch introduces shared RX queue. Ports with same
> > >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> > >>>>> Polling any queue using same shared RX queue receives packets
> > >>>>> from all member ports. Source port is identified by mbuf->port.
> > >>>>>
> > >>>>> Port queue number in a shared group should be identical. Queue
> > >>>>> index is
> > >>>>> 1:1 mapped in shared group.
> > >>>>>
> > >>>>> Share RX queue is supposed to be polled on same thread.
> > >>>>>
> > >>>>> Multiple groups is supported by group ID.
> > >>>>
> > >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > >>>
> > >>> Yes, PF and representor in switch domain could take advantage.
> > >>>
> > >>>> If it is for a generic case, how the flow ordering will be maintained?
> > >>>
> > >>> Not quite sure that I understood your question. The control path
> > >>> of is almost same as before, PF and representor port still needed, rte flows not impacted.
> > >>> Queues still needed for each member port, descriptors(mbuf) will
> > >>> be supplied from shared Rx queue in my PMD implementation.
> > >>
> > >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > >> offload, multiple ethdev receive queues land into the same receive
> > >> queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq
> > > forwarding, call it with packets handling callback, so it suites for
> > > all
> > forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue),
> > right? Does this means only single core will poll only, what will happen if there are multiple cores polling, won't it cause problem?
> 
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.

V2 with testpmd code uploaded, please check.

> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal
> api.
> 
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
> 
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in
> group.
> 
> If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
> 
> >
> > And if this requires specific changes in the application, I am not
> > sure about the solution, can't this work in a transparent way to the application?
> 
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
> This can be done as a wrapper PMD later, more efforts.
> 
> >
> > Overall, is this for optimizing memory for the port represontors? If
> > so can't we have a port representor specific solution, reducing scope can reduce the complexity it brings?
> 
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also introduces
> more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
> 
> It's great if any new solution/suggestion, my head buried in PMD code :)
> 
> >
> > >> If this offload is only useful for representor case, Can we make
> > >> this offload specific to representor the case by changing its name and scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > >>
> > >>
> > >>>
> > >>>>
> > >>>>>
> > >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > >>>>> ---
> > >>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
> > >>>>>  doc/guides/nics/features/default.ini            |  1 +
> > >>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
> > >>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > >>>>>  5 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/doc/guides/nics/features.rst
> > >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554
> > >>>>> 100644
> > >>>>> --- a/doc/guides/nics/features.rst
> > >>>>> +++ b/doc/guides/nics/features.rst
> > >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >>>>>
> > >>>>>
> > >>>>> +.. _nic_features_shared_rx_queue:
> > >>>>> +
> > >>>>> +Shared Rx queue
> > >>>>> +---------------
> > >>>>> +
> > >>>>> +Supports shared Rx queue for ports in same switch domain.
> > >>>>> +
> > >>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> > >>>>> +
> > >>>>> +
> > >>>>>  .. _nic_features_packet_type_parsing:
> > >>>>>
> > >>>>>  Packet type parsing
> > >>>>> diff --git a/doc/guides/nics/features/default.ini
> > >>>>> b/doc/guides/nics/features/default.ini
> > >>>>> index 754184ddd4..ebeb4c1851 100644
> > >>>>> --- a/doc/guides/nics/features/default.ini
> > >>>>> +++ b/doc/guides/nics/features/default.ini
> > >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >>>>>  Queue start/stop     =
> > >>>>>  Runtime Rx queue setup =
> > >>>>>  Runtime Tx queue setup =
> > >>>>> +Shared Rx queue      =
> > >>>>>  Burst mode info      =
> > >>>>>  Power mgmt address monitor =
> > >>>>>  MTU update           =
> > >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> index ff6aa91c80..45bf5a3a10 100644
> > >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >>>>>  .. [1] `Ethernet switch device driver model (switchdev)
> > >>>>>
> > >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.t
> > >>>>> xt
> > >>>>>> `_
> > >>>>>
> > >>>>> +- Memory usage of representors is huge when number of
> > >>>>> +representor grows,
> > >>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > >>>>> +  Polling the large number of ports brings more CPU load, cache
> > >>>>> +miss and
> > >>>>> +  latency. Shared Rx queue can be used to share Rx queue
> > >>>>> +between PF and
> > >>>>> +  representors in same switch domain.
> > >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > >>>>> +  is present in Rx offloading capability of device info.
> > >>>>> +Setting the
> > >>>>> +  offloading flag in device Rx mode or Rx queue configuration
> > >>>>> +to enable
> > >>>>> +  shared Rx queue. Polling any member port of shared Rx queue
> > >>>>> +can return
> > >>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > >>>>> +
> > >>>>>  Basic SR-IOV
> > >>>>>  ------------
> > >>>>>
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>>> index 9d95cd11e1..1361ff759a 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>>> @@ -127,6 +127,7 @@ static const struct {
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > >>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >>>>>  };
> > >>>>>
> > >>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > >>>>> index d2b27c351f..a578c9db9d 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.h
> > >>>>> +++ b/lib/ethdev/rte_ethdev.h
> > >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > >>>>> */
> > >>>>> +       uint32_t shared_group; /**< Shared port group index in
> > >>>>> + switch domain. */
> > >>>>>         /**
> > >>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >>>>>          * Only offloads set on rx_queue_offload_capa or
> > >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > >>>>> +/**
> > >>>>> + * Rx queue is shared among ports in same switch domain to save
> > >>>>> +memory,
> > >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> > >>>>> + * Real source port number saved in mbuf->port field.
> > >>>>> + */
> > >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >>>>>
> > >>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > >>>>> --
> > >>>>> 2.25.1
> > >>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (13 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd Xueming Li
@ 2021-08-17  9:33   ` Jerin Jacob
  2021-08-17 11:31     ` Xueming(Steven) Li
  14 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17  9:33 UTC (permalink / raw)
  To: Xueming Li; +Cc: dpdk-dev, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue must be polled on single thread or core.
>
> Multiple groups is supported by group ID.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>
> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html

>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */

Not to able to see anyone setting/creating this group ID test application.
How this group is created?


>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
@ 2021-08-17  9:37     ` Jerin Jacob
  2021-08-18 11:27       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17  9:37 UTC (permalink / raw)
  To: Xueming Li; +Cc: Xiaoyu Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> From: Xiaoyu Min <jackmin@nvidia.com>
>
> Added an inline common wrapper function for all fwd engines
> which do the following in common:
>
> 1. get_start_cycles
> 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
> 4. get_end_cycle
>
> Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> ---
>  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index 13141dfed9..b685ac48d6 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
>  void remove_tx_dynf_callback(portid_t portid);
>  int update_jumbo_frame_offload(portid_t portid);
>
> +static inline void
> +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
> +{
> +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +       uint16_t nb_rx;
> +       uint64_t start_tsc = 0;
> +
> +       get_start_cycles(&start_tsc);
> +
> +       /*
> +        * Receive a burst of packets and forward them.
> +        */
> +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> +                       pkts_burst, nb_pkt_per_burst);
> +       inc_rx_burst_stats(fs, nb_rx);
> +       if (unlikely(nb_rx == 0))
> +               return;
> +       if (unlikely(rxq_share > 0))

See below. It reads a global memory.

> +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> +       else
> +               (*fwd)(fs, nb_rx, pkts_burst);

New function pointer in fastpath.

IMO, We should not create performance regression for the existing
forward engine.
Can we have a new forward engine just for shared memory testing?

> +       get_end_cycles(fs, start_tsc);
> +}
> +
>  /*
>   * Work-around of a compilation error with ICC on invocations of the
>   * rte_be_to_cpu_16() function.
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
@ 2021-08-17 11:31     ` Xueming(Steven) Li
  2021-08-17 15:11       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-17 11:31 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:33 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue must be polled on single thread or core.
> >
> > Multiple groups is supported by group ID.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > ---
> > Rx queue object could be used as shared Rx queue object, it's
> > important to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> 
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > +       uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> 
> Not to able to see anyone setting/creating this group ID test application.
> How this group is created?

Nice catch, the initial testpmd version only support one default group(0).
All ports that supports shared-rxq assigned in same group.

We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
group other than default.

To support more groups simultaneously, need to consider testpmd forwarding stream
core assignment, all streams in same group need to stay on same core. 
It's possible to specify how many ports to increase group number, but user must
schedule stream affinity carefully - error prone.
 
On the other hand, one group should be sufficient for most customer, the doubt is
whether it valuable to support multiple groups test.

> 
> 
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17 11:31     ` Xueming(Steven) Li
@ 2021-08-17 15:11       ` Jerin Jacob
  2021-08-18 11:14         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17 15:11 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:33 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue must be polled on single thread or core.
> > >
> > > Multiple groups is supported by group ID.
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > ---
> > > Rx queue object could be used as shared Rx queue object, it's
> > > important to clear all queue control callback api that using queue object:
> > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> >
> > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > +       uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> >
> > Not to able to see anyone setting/creating this group ID test application.
> > How this group is created?
>
> Nice catch, the initial testpmd version only support one default group(0).
> All ports that supports shared-rxq assigned in same group.
>
> We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
> group other than default.
>
> To support more groups simultaneously, need to consider testpmd forwarding stream
> core assignment, all streams in same group need to stay on same core.
> It's possible to specify how many ports to increase group number, but user must
> schedule stream affinity carefully - error prone.
>
> On the other hand, one group should be sufficient for most customer, the doubt is
> whether it valuable to support multiple groups test.

Ack. One group is enough in testpmd.

My question was more about who and how this group is created, Should n't we need
API to create shared_group? If we do the following, at least, I can
think, how it
can be implemented in SW or other HW.

- Create aggregation queue group
- Attach multiple  Rx queues to the aggregation queue group
- Pull the packets from the queue group(which internally fetch from
the Rx queues _attached_)

Does the above kind of sequence, break your representor use case?


>
> >
> >
> > >         /**
> > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >          * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >
> > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17 15:11       ` Jerin Jacob
@ 2021-08-18 11:14         ` Xueming(Steven) Li
  2021-08-19  5:26           ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 11:14 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 11:12 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue must be polled on single thread or core.
> > > >
> > > > Multiple groups is supported by group ID.
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > ---
> > > > Rx queue object could be used as shared Rx queue object, it's
> > > > important to clear all queue control callback api that using queue object:
> > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > >
> > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > >
> > > Not to able to see anyone setting/creating this group ID test application.
> > > How this group is created?
> >
> > Nice catch, the initial testpmd version only support one default group(0).
> > All ports that supports shared-rxq assigned in same group.
> >
> > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > support group other than default.
> >
> > To support more groups simultaneously, need to consider testpmd
> > forwarding stream core assignment, all streams in same group need to stay on same core.
> > It's possible to specify how many ports to increase group number, but
> > user must schedule stream affinity carefully - error prone.
> >
> > On the other hand, one group should be sufficient for most customer,
> > the doubt is whether it valuable to support multiple groups test.
> 
> Ack. One group is enough in testpmd.
> 
> My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> following, at least, I can think, how it can be implemented in SW or other HW.
> 
> - Create aggregation queue group
> - Attach multiple  Rx queues to the aggregation queue group
> - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> 
> Does the above kind of sequence, break your representor use case?

Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
- step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
- step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
- step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
  currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
  be used to receive packets from any ports in group, normally the first port(PF) in group.
  An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
  the shared rxq group - this could be an helper API.

Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

> 
> 
> >
> > >
> > >
> > > >         /**
> > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > >          * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > >
> > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-17  9:37     ` Jerin Jacob
@ 2021-08-18 11:27       ` Xueming(Steven) Li
  2021-08-18 11:47         ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 11:27 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:37 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > From: Xiaoyu Min <jackmin@nvidia.com>
> >
> > Added an inline common wrapper function for all fwd engines which do
> > the following in common:
> >
> > 1. get_start_cycles
> > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > get_end_cycle
> >
> > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > ---
> >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 13141dfed9..b685ac48d6 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > void remove_tx_dynf_callback(portid_t portid);  int
> > update_jumbo_frame_offload(portid_t portid);
> >
> > +static inline void
> > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > +       uint16_t nb_rx;
> > +       uint64_t start_tsc = 0;
> > +
> > +       get_start_cycles(&start_tsc);
> > +
> > +       /*
> > +        * Receive a burst of packets and forward them.
> > +        */
> > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > +                       pkts_burst, nb_pkt_per_burst);
> > +       inc_rx_burst_stats(fs, nb_rx);
> > +       if (unlikely(nb_rx == 0))
> > +               return;
> > +       if (unlikely(rxq_share > 0))
> 
> See below. It reads a global memory.
> 
> > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > +       else
> > +               (*fwd)(fs, nb_rx, pkts_burst);
> 
> New function pointer in fastpath.
> 
> IMO, We should not create performance regression for the existing forward engine.
> Can we have a new forward engine just for shared memory testing?

Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
Based on test data, the impacts almost invisible in legacy mode.

From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

> 
> > +       get_end_cycles(fs, start_tsc); }
> > +
> >  /*
> >   * Work-around of a compilation error with ICC on invocations of the
> >   * rte_be_to_cpu_16() function.
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 11:27       ` Xueming(Steven) Li
@ 2021-08-18 11:47         ` Jerin Jacob
  2021-08-18 14:08           ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-18 11:47 UTC (permalink / raw)
  To: Xueming(Steven) Li; +Cc: Jack Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:37 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > From: Xiaoyu Min <jackmin@nvidia.com>
> > >
> > > Added an inline common wrapper function for all fwd engines which do
> > > the following in common:
> > >
> > > 1. get_start_cycles
> > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > get_end_cycle
> > >
> > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > ---
> > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > 13141dfed9..b685ac48d6 100644
> > > --- a/app/test-pmd/testpmd.h
> > > +++ b/app/test-pmd/testpmd.h
> > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > void remove_tx_dynf_callback(portid_t portid);  int
> > > update_jumbo_frame_offload(portid_t portid);
> > >
> > > +static inline void
> > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > +       uint16_t nb_rx;
> > > +       uint64_t start_tsc = 0;
> > > +
> > > +       get_start_cycles(&start_tsc);
> > > +
> > > +       /*
> > > +        * Receive a burst of packets and forward them.
> > > +        */
> > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > +                       pkts_burst, nb_pkt_per_burst);
> > > +       inc_rx_burst_stats(fs, nb_rx);
> > > +       if (unlikely(nb_rx == 0))
> > > +               return;
> > > +       if (unlikely(rxq_share > 0))
> >
> > See below. It reads a global memory.
> >
> > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > +       else
> > > +               (*fwd)(fs, nb_rx, pkts_burst);
> >
> > New function pointer in fastpath.
> >
> > IMO, We should not create performance regression for the existing forward engine.
> > Can we have a new forward engine just for shared memory testing?
>
> Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> Based on test data, the impacts almost invisible in legacy mode.

Are you saying there is zero % regression? If not, could you share the data?

>
> From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
> regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

Yes. it good goal. testpmd forward performance using as synthetic
bench everyone.
I think, we are aligned to not have any regression for the generic
forward engine.

>
> >
> > > +       get_end_cycles(fs, start_tsc); }
> > > +
> > >  /*
> > >   * Work-around of a compilation error with ICC on invocations of the
> > >   * rte_be_to_cpu_16() function.
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 11:47         ` Jerin Jacob
@ 2021-08-18 14:08           ` Xueming(Steven) Li
  2021-08-26 11:28             ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 14:08 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 18, 2021 7:48 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > >
> > > > Added an inline common wrapper function for all fwd engines which
> > > > do the following in common:
> > > >
> > > > 1. get_start_cycles
> > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > get_end_cycle
> > > >
> > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > ---
> > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > 13141dfed9..b685ac48d6 100644
> > > > --- a/app/test-pmd/testpmd.h
> > > > +++ b/app/test-pmd/testpmd.h
> > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > update_jumbo_frame_offload(portid_t portid);
> > > >
> > > > +static inline void
> > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > +       uint16_t nb_rx;
> > > > +       uint64_t start_tsc = 0;
> > > > +
> > > > +       get_start_cycles(&start_tsc);
> > > > +
> > > > +       /*
> > > > +        * Receive a burst of packets and forward them.
> > > > +        */
> > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > +       if (unlikely(nb_rx == 0))
> > > > +               return;
> > > > +       if (unlikely(rxq_share > 0))
> > >
> > > See below. It reads a global memory.
> > >
> > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > +       else
> > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > >
> > > New function pointer in fastpath.
> > >
> > > IMO, We should not create performance regression for the existing forward engine.
> > > Can we have a new forward engine just for shared memory testing?
> >
> > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > Based on test data, the impacts almost invisible in legacy mode.
> 
> Are you saying there is zero % regression? If not, could you share the data?

Almost zero, here is a quick single core result of rxonly:
	32.2Mpps, 58.9cycles/packet
Revert the patch to rxonly.c:
	32.1Mpps 59.9cycles/packet
The result doesn't make sense and I realized that I used batch mbuf free, apply it now: 
	32.2Mpps, 58.9cycles/packet
There were small digit jumps between testpmd restart, I picked the best one.
The result is almost same, seems the cost of each packet is small enough.
BTW, I'm testing with default burst size and queue depth.

> 
> >
> > From test perspective, better to have all forward engine to verify
> > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> seamlessly.
> 
> Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> I think, we are aligned to not have any regression for the generic forward engine.
> 
> >
> > >
> > > > +       get_end_cycles(fs, start_tsc); }
> > > > +
> > > >  /*
> > > >   * Work-around of a compilation error with ICC on invocations of the
> > > >   * rte_be_to_cpu_16() function.
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-18 11:14         ` Xueming(Steven) Li
@ 2021-08-19  5:26           ` Jerin Jacob
  2021-08-19 12:09             ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-19  5:26 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 11:12 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > for incoming packets. When number of representors scale out in a
> > > > > switch domain, the memory consumption became significant. Most
> > > > > important, polling all ports leads to high cache miss, high
> > > > > latency and low throughput.
> > > > >
> > > > > This patch introduces shared RX queue. Ports with same
> > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > Polling any queue using same shared RX queue receives packets from
> > > > > all member ports. Source port is identified by mbuf->port.
> > > > >
> > > > > Port queue number in a shared group should be identical. Queue
> > > > > index is
> > > > > 1:1 mapped in shared group.
> > > > >
> > > > > Share RX queue must be polled on single thread or core.
> > > > >
> > > > > Multiple groups is supported by group ID.
> > > > >
> > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > ---
> > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > important to clear all queue control callback api that using queue object:
> > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > >
> > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > index d2b27c351f..a578c9db9d 100644
> > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > */
> > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > + switch domain. */
> > > >
> > > > Not to able to see anyone setting/creating this group ID test application.
> > > > How this group is created?
> > >
> > > Nice catch, the initial testpmd version only support one default group(0).
> > > All ports that supports shared-rxq assigned in same group.
> > >
> > > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > > support group other than default.
> > >
> > > To support more groups simultaneously, need to consider testpmd
> > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > It's possible to specify how many ports to increase group number, but
> > > user must schedule stream affinity carefully - error prone.
> > >
> > > On the other hand, one group should be sufficient for most customer,
> > > the doubt is whether it valuable to support multiple groups test.
> >
> > Ack. One group is enough in testpmd.
> >
> > My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> > following, at least, I can think, how it can be implemented in SW or other HW.
> >
> > - Create aggregation queue group
> > - Attach multiple  Rx queues to the aggregation queue group
> > - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> >
> > Does the above kind of sequence, break your representor use case?
>
> Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.

Which rte_flow pattern/action for this?

> - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
>   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
>   be used to receive packets from any ports in group, normally the first port(PF) in group.
>   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
>   the shared rxq group - this could be an helper API.
>
> Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

Are you doing this feature based on any HW support or it just pure SW
thing, If it is SW, It is better to have
just new vdev for like drivers/net/bonding/. This we can help
aggregate multiple Rxq across the multiple ports
of same the driver.


>
> >
> >
> > >
> > > >
> > > >
> > > > >         /**
> > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > +/**
> > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > +memory,
> > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > + * Real source port number saved in mbuf->port field.
> > > > > + */
> > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > >
> > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > --
> > > > > 2.25.1
> > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-19  5:26           ` Jerin Jacob
@ 2021-08-19 12:09             ` Xueming(Steven) Li
  2021-08-26 11:58               ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-19 12:09 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 19, 2021 1:27 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > out in a switch domain, the memory consumption became
> > > > > > significant. Most important, polling all ports leads to high
> > > > > > cache miss, high latency and low throughput.
> > > > > >
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > >
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > >
> > > > > > Share RX queue must be polled on single thread or core.
> > > > > >
> > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > ---
> > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > important to clear all queue control callback api that using queue object:
> > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > >
> > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > >
> > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > How this group is created?
> > > >
> > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > All ports that supports shared-rxq assigned in same group.
> > > >
> > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > to support group other than default.
> > > >
> > > > To support more groups simultaneously, need to consider testpmd
> > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > It's possible to specify how many ports to increase group number,
> > > > but user must schedule stream affinity carefully - error prone.
> > > >
> > > > On the other hand, one group should be sufficient for most
> > > > customer, the doubt is whether it valuable to support multiple groups test.
> > >
> > > Ack. One group is enough in testpmd.
> > >
> > > My question was more about who and how this group is created, Should
> > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > >
> > > - Create aggregation queue group
> > > - Attach multiple  Rx queues to the aggregation queue group
> > > - Pull the packets from the queue group(which internally fetch from
> > > the Rx queues _attached_)
> > >
> > > Does the above kind of sequence, break your representor use case?
> >
> > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> 
> Which rte_flow pattern/action for this?

No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

> 
> > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> >   the shared rxq group - this could be an helper API.
> >
> > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> 
> Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.

Based on HW support. 

Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do 
that either due to the strategy not complex enough. Anyway, welcome any suggestion.

> 
> 
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >         /**
> > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > +save memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > >
> > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 14:08           ` Xueming(Steven) Li
@ 2021-08-26 11:28             ` Jerin Jacob
  2021-08-29  7:07               ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-26 11:28 UTC (permalink / raw)
  To: Xueming(Steven) Li; +Cc: Jack Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Wednesday, August 18, 2021 7:48 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > > Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > > wrapper function
> > > >
> > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > >
> > > > > Added an inline common wrapper function for all fwd engines which
> > > > > do the following in common:
> > > > >
> > > > > 1. get_start_cycles
> > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > get_end_cycle
> > > > >
> > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > ---
> > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > >  1 file changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > > 13141dfed9..b685ac48d6 100644
> > > > > --- a/app/test-pmd/testpmd.h
> > > > > +++ b/app/test-pmd/testpmd.h
> > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > > update_jumbo_frame_offload(portid_t portid);
> > > > >
> > > > > +static inline void
> > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > +       uint16_t nb_rx;
> > > > > +       uint64_t start_tsc = 0;
> > > > > +
> > > > > +       get_start_cycles(&start_tsc);
> > > > > +
> > > > > +       /*
> > > > > +        * Receive a burst of packets and forward them.
> > > > > +        */
> > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > +       if (unlikely(nb_rx == 0))
> > > > > +               return;
> > > > > +       if (unlikely(rxq_share > 0))
> > > >
> > > > See below. It reads a global memory.
> > > >
> > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > +       else
> > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > >
> > > > New function pointer in fastpath.
> > > >
> > > > IMO, We should not create performance regression for the existing forward engine.
> > > > Can we have a new forward engine just for shared memory testing?
> > >
> > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > Based on test data, the impacts almost invisible in legacy mode.
> >
> > Are you saying there is zero % regression? If not, could you share the data?
>
> Almost zero, here is a quick single core result of rxonly:
>         32.2Mpps, 58.9cycles/packet
> Revert the patch to rxonly.c:
>         32.1Mpps 59.9cycles/packet
> The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
>         32.2Mpps, 58.9cycles/packet
> There were small digit jumps between testpmd restart, I picked the best one.
> The result is almost same, seems the cost of each packet is small enough.
> BTW, I'm testing with default burst size and queue depth.

I tested this on octeontx2 with iofwd with single core with 100Gbps
Without this patch - 73.5mpps
With this patch - 72.8 mpps

We are taking the shared queue runtime option without a separate fwd engine.
and to have zero performance impact and no compile time flag
Then I think, only way to have a function template .
Example change to outline function template principle.

static inline
__pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag)
{

Introduce new checks under
if (flags & SHARED_QUEUE)


}

Have two versions of io_fwd_engine.packet_fwd per engine.

- first version
static pkt_burst_io_forward(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, 0);
}

- Second version
static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, SHARED_QUEUE);
}


Update io_fwd_engine.packet_fwd in slowpath to respective version
based on offload.

If shared offoad is not selected, pkt_burst_io_forward() will be selected and
__pkt_burst_io_forward() will be a compile time version of
!SHARED_QUEUE aka same as existing coe.

>
> >
> > >
> > > From test perspective, better to have all forward engine to verify
> > > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> > seamlessly.
> >
> > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > I think, we are aligned to not have any regression for the generic forward engine.
> >
> > >
> > > >
> > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > +
> > > > >  /*
> > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > >   * rte_be_to_cpu_16() function.
> > > > > --
> > > > > 2.25.1
> > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-19 12:09             ` Xueming(Steven) Li
@ 2021-08-26 11:58               ` Jerin Jacob
  2021-08-28 14:16                 ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-26 11:58 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 19, 2021 1:27 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > > out in a switch domain, the memory consumption became
> > > > > > > significant. Most important, polling all ports leads to high
> > > > > > > cache miss, high latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > ---
> > > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > > important to clear all queue control callback api that using queue object:
> > > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > >
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > >
> > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > How this group is created?
> > > > >
> > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > All ports that supports shared-rxq assigned in same group.
> > > > >
> > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > to support group other than default.
> > > > >
> > > > > To support more groups simultaneously, need to consider testpmd
> > > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > It's possible to specify how many ports to increase group number,
> > > > > but user must schedule stream affinity carefully - error prone.
> > > > >
> > > > > On the other hand, one group should be sufficient for most
> > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > >
> > > > Ack. One group is enough in testpmd.
> > > >
> > > > My question was more about who and how this group is created, Should
> > > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > > >
> > > > - Create aggregation queue group
> > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > - Pull the packets from the queue group(which internally fetch from
> > > > the Rx queues _attached_)
> > > >
> > > > Does the above kind of sequence, break your representor use case?
> > >
> > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> >
> > Which rte_flow pattern/action for this?
>
> No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

See below.

>
> >
> > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > >   the shared rxq group - this could be an helper API.
> > >
> > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> >
> > Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> > drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.
>
> Based on HW support.

In Marvel HW, we do some support, I will outline here and some queries on this.

# We need to create some new HW structure for aggregation
# Connect each Rxq to the new HW structure for aggregation
# Use rx_burst from the new HW structure.

Could you outline your HW support?

Also, I am not able to understand how this will reduce the memory,
atleast in our HW need creating more memory now to deal this
as we need to deal new HW structure.

How is in your HW it reduces the memory? Also, if memory is the
constraint, why NOT reduce the number of queues.

# Also, I was thinking, one way to avoid the fast path or ABI change would like.

# Driver Initializes one more eth_dev_ops in driver as aggregator ethdev
# devargs of new ethdev or specific API like
drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue)
tuples which needs to aggregate by new ethdev port
# No change in fastpath or ABI is required in this model.



> Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do
> that either due to the strategy not complex enough. Anyway, welcome any suggestion.
>
> >
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > > +save memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > >
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-26 11:58               ` Jerin Jacob
@ 2021-08-28 14:16                 ` Xueming(Steven) Li
  2021-08-30  9:31                   ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-28 14:16 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:58 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 19, 2021 1:27 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > consumption became significant. Most important, polling
> > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > Queue index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > ---
> > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > >
> > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > >
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > + index in switch domain. */
> > > > > > >
> > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > How this group is created?
> > > > > >
> > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > >
> > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > to support group other than default.
> > > > > >
> > > > > > To support more groups simultaneously, need to consider
> > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > It's possible to specify how many ports to increase group
> > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > >
> > > > > > On the other hand, one group should be sufficient for most
> > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > >
> > > > > Ack. One group is enough in testpmd.
> > > > >
> > > > > My question was more about who and how this group is created,
> > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> or other HW.
> > > > >
> > > > > - Create aggregation queue group
> > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > - Pull the packets from the queue group(which internally fetch
> > > > > from the Rx queues _attached_)
> > > > >
> > > > > Does the above kind of sequence, break your representor use case?
> > > >
> > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > >
> > > Which rte_flow pattern/action for this?
> >
> > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> 
> See below.
> 
> >
> > >
> > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > >   the shared rxq group - this could be an helper API.
> > > >
> > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > >
> > > Are you doing this feature based on any HW support or it just pure
> > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> the multiple ports of same the driver.
> >
> > Based on HW support.
> 
> In Marvel HW, we do some support, I will outline here and some queries on this.
> 
> # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> rx_burst from the new HW structure.
> 
> Could you outline your HW support?
> 
> Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> we need to deal new HW structure.
> 
> How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> 

Glad to know that Marvel is working on this, what's the status of driver implementation?

In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
mbufs for each rxq, just feed the shared rxq.

So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
The memory required to setup each rxq doesn't change too much, agree.

> # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> 
> # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> change in fastpath or ABI is required in this model.
> 

This could be an option to access shared rxq. What's the difference of the new PMD?
What's the difference of PMD driver to create the new device? 

Is it important in your implementation? Does it work with existing rx_burst api?

> 
> 
> > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > but some user might prefer grouping some hot
> > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> Anyway, welcome any suggestion.
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > +to save memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > >
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-26 11:28             ` Jerin Jacob
@ 2021-08-29  7:07               ` Xueming(Steven) Li
  2021-09-01 14:44                 ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-29  7:07 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:28 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > >
> > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > which do the following in common:
> > > > > >
> > > > > > 1. get_start_cycles
> > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > get_end_cycle
> > > > > >
> > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > ---
> > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > >  1 file changed, 24 insertions(+)
> > > > > >
> > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > index
> > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > >
> > > > > > +static inline void
> > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > +       uint16_t nb_rx;
> > > > > > +       uint64_t start_tsc = 0;
> > > > > > +
> > > > > > +       get_start_cycles(&start_tsc);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Receive a burst of packets and forward them.
> > > > > > +        */
> > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > +               return;
> > > > > > +       if (unlikely(rxq_share > 0))
> > > > >
> > > > > See below. It reads a global memory.
> > > > >
> > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > +       else
> > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > >
> > > > > New function pointer in fastpath.
> > > > >
> > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > Can we have a new forward engine just for shared memory testing?
> > > >
> > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > Based on test data, the impacts almost invisible in legacy mode.
> > >
> > > Are you saying there is zero % regression? If not, could you share the data?
> >
> > Almost zero, here is a quick single core result of rxonly:
> >         32.2Mpps, 58.9cycles/packet
> > Revert the patch to rxonly.c:
> >         32.1Mpps 59.9cycles/packet
> > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> >         32.2Mpps, 58.9cycles/packet
> > There were small digit jumps between testpmd restart, I picked the best one.
> > The result is almost same, seems the cost of each packet is small enough.
> > BTW, I'm testing with default burst size and queue depth.
> 
> I tested this on octeontx2 with iofwd with single core with 100Gbps Without this patch - 73.5mpps With this patch - 72.8 mpps
> 
> We are taking the shared queue runtime option without a separate fwd engine.
> and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> Example change to outline function template principle.
> 
> static inline
> __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> 
> Introduce new checks under
> if (flags & SHARED_QUEUE)
> 
> 
> }
> 
> Have two versions of io_fwd_engine.packet_fwd per engine.
> 
> - first version
> static pkt_burst_io_forward(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, 0); }
> 
> - Second version
> static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> 
> 
> Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> 
> If shared offoad is not selected, pkt_burst_io_forward() will be selected and
> __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.

Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed
to function parameter, right? Have you tested this performance? If not, I could verify.

> 
> >
> > >
> > > >
> > > > From test perspective, better to have all forward engine to verify
> > > > shared rxq, test team want to run the regression with less
> > > > impacts. Hope to have a solution to utilize all forwarding engines
> > > seamlessly.
> > >
> > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > I think, we are aligned to not have any regression for the generic forward engine.
> > >
> > > >
> > > > >
> > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > +
> > > > > >  /*
> > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > >   * rte_be_to_cpu_16() function.
> > > > > > --
> > > > > > 2.25.1
> > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-28 14:16                 ` Xueming(Steven) Li
@ 2021-08-30  9:31                   ` Jerin Jacob
  2021-08-30 10:13                     ` Xueming(Steven) Li
  2021-09-15 14:45                     ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-08-30  9:31 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:58 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > queue
> > > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > Queue index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > ---
> > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > >
> > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > >
> > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > + index in switch domain. */
> > > > > > > >
> > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > How this group is created?
> > > > > > >
> > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > >
> > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > to support group other than default.
> > > > > > >
> > > > > > > To support more groups simultaneously, need to consider
> > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > It's possible to specify how many ports to increase group
> > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > >
> > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > >
> > > > > > Ack. One group is enough in testpmd.
> > > > > >
> > > > > > My question was more about who and how this group is created,
> > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > or other HW.
> > > > > >
> > > > > > - Create aggregation queue group
> > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > from the Rx queues _attached_)
> > > > > >
> > > > > > Does the above kind of sequence, break your representor use case?
> > > > >
> > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > >
> > > > Which rte_flow pattern/action for this?
> > >
> > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> >
> > See below.
> >
> > >
> > > >
> > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > >   the shared rxq group - this could be an helper API.
> > > > >
> > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > >
> > > > Are you doing this feature based on any HW support or it just pure
> > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > the multiple ports of same the driver.
> > >
> > > Based on HW support.
> >
> > In Marvel HW, we do some support, I will outline here and some queries on this.
> >
> > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > rx_burst from the new HW structure.
> >
> > Could you outline your HW support?
> >
> > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > we need to deal new HW structure.
> >
> > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> >
>
> Glad to know that Marvel is working on this, what's the status of driver implementation?
>
> In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> mbufs for each rxq, just feed the shared rxq.
>
> So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> The memory required to setup each rxq doesn't change too much, agree.

We can ask the application to configure the same mempool for multiple
RQ too. RIght? If the saving is based on sharing the mempool
with multiple RQs.

>
> > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> >
> > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > change in fastpath or ABI is required in this model.
> >
>
> This could be an option to access shared rxq. What's the difference of the new PMD?

No ABI and fast change are required.

> What's the difference of PMD driver to create the new device?
>
> Is it important in your implementation? Does it work with existing rx_burst api?

Yes . It will work with the existing rx_burst API.

>
> >
> >
> > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > but some user might prefer grouping some hot
> > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > Anyway, welcome any suggestion.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >         /**
> > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > +to save memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > >
> > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-30  9:31                   ` Jerin Jacob
@ 2021-08-30 10:13                     ` Xueming(Steven) Li
  2021-09-15 14:45                     ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-30 10:13 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 30, 2021 5:31 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared
> > > > > > > > > Rx queue
> > > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important,
> > > > > > > > > > polling all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > >
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > >
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > >
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > >
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue
> > > > > > > > > > object, it's important to clear all queue control callback api that using queue object:
> > > > > > > > > >
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.h
> > > > > > > > > > tml
> > > > > > > > >
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > >
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > >
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > >
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > >
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > >
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > >
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > >
> > > > > > > My question was more about who and how this group is
> > > > > > > created, Should n't we need API to create shared_group? If
> > > > > > > we do the following, at least, I can think, how it can be
> > > > > > > implemented in SW
> > > or other HW.
> > > > > > >
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally
> > > > > > > fetch from the Rx queues _attached_)
> > > > > > >
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > >
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > >
> > > > > Which rte_flow pattern/action for this?
> > > >
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > >
> > > See below.
> > >
> > > >
> > > > >
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > >
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > >
> > > > > Are you doing this feature based on any HW support or it just
> > > > > pure SW thing, If it is SW, It is better to have just new vdev
> > > > > for like drivers/net/bonding/. This we can help aggregate
> > > > > multiple Rxq across
> > > the multiple ports of same the driver.
> > > >
> > > > Based on HW support.
> > >
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > >
> > > # We need to create some new HW structure for aggregation # Connect
> > > each Rxq to the new HW structure for aggregation # Use rx_burst from the new HW structure.
> > >
> > > Could you outline your HW support?
> > >
> > > Also, I am not able to understand how this will reduce the memory,
> > > atleast in our HW need creating more memory now to deal this as we need to deal new HW structure.
> > >
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > >
> >
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> >
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors,
> > now shared rxqs share the same pool, no need to supply mbufs for each rxq, just feed the shared rxq.
> >
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW
> mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.

Yes, using the same mempool is the fundamental. The difference is how many mbufs allocate from pool.
Assuming 512 descriptors perf rxq and 4 rxqs per device, it's 2.3K(mbuf) * 512 * 4 = 4.6M / device
To support 1000 representors, need a 4.6G mempool :)
For shared rxq, only 4.6M(one device) mbufs allocate from mempool, they are shared for all rxqs in group.

> 
> >
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > >
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator
> > > ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port #
> No change in fastpath or ABI is required in this model.
> > >
> >
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> >
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 
> >
> > >
> > >
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa
> > > > > > > > > > or rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > rte_eth_conf { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch
> > > > > > > > > > +domain to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > >
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >
> > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-29  7:07               ` Xueming(Steven) Li
@ 2021-09-01 14:44                 ` Xueming(Steven) Li
  2021-09-28  5:54                   ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-01 14:44 UTC (permalink / raw)
  To: Xueming(Steven) Li, Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Sunday, August 29, 2021 3:08 PM
> To: Jerin Jacob <jerinjacobk@gmail.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:28 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > wrapper function
> >
> > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > fwd wrapper function
> > > >
> > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > common fwd wrapper function
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > >
> > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > which do the following in common:
> > > > > > >
> > > > > > > 1. get_start_cycles
> > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > get_end_cycle
> > > > > > >
> > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > ---
> > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > >  1 file changed, 24 insertions(+)
> > > > > > >
> > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > index
> > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > >
> > > > > > > +static inline void
> > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > +       uint16_t nb_rx;
> > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > +
> > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > +        */
> > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > +               return;
> > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > >
> > > > > > See below. It reads a global memory.
> > > > > >
> > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > +       else
> > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > >
> > > > > > New function pointer in fastpath.
> > > > > >
> > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > Can we have a new forward engine just for shared memory testing?
> > > > >
> > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > >
> > > > Are you saying there is zero % regression? If not, could you share the data?
> > >
> > > Almost zero, here is a quick single core result of rxonly:
> > >         32.2Mpps, 58.9cycles/packet
> > > Revert the patch to rxonly.c:
> > >         32.1Mpps 59.9cycles/packet
> > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > >         32.2Mpps, 58.9cycles/packet
> > > There were small digit jumps between testpmd restart, I picked the best one.
> > > The result is almost same, seems the cost of each packet is small enough.
> > > BTW, I'm testing with default burst size and queue depth.
> >
> > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > Without this patch - 73.5mpps With this patch - 72.8 mpps
> >
> > We are taking the shared queue runtime option without a separate fwd engine.
> > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > Example change to outline function template principle.
> >
> > static inline
> > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> >
> > Introduce new checks under
> > if (flags & SHARED_QUEUE)
> >
> >
> > }
> >
> > Have two versions of io_fwd_engine.packet_fwd per engine.
> >
> > - first version
> > static pkt_burst_io_forward(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, 0); }
> >
> > - Second version
> > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> >
> >
> > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> >
> > If shared offoad is not selected, pkt_burst_io_forward() will be
> > selected and
> > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> 
> Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> right? Have you tested this performance? If not, I could verify.

Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

> 
> >
> > >
> > > >
> > > > >
> > > > > From test perspective, better to have all forward engine to
> > > > > verify shared rxq, test team want to run the regression with
> > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > engines
> > > > seamlessly.
> > > >
> > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > >
> > > > >
> > > > > >
> > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-30  9:31                   ` Jerin Jacob
  2021-08-30 10:13                     ` Xueming(Steven) Li
@ 2021-09-15 14:45                     ` Xueming(Steven) Li
  2021-09-16  4:16                       ` Jerin Jacob
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-15 14:45 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

Hi Jerin,

On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > 
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > queue
> > > > > > > > > 
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > 
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > 
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > 
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > 
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > 
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > > 
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > > 
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > 
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > > 
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > 
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > 
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > > 
> > > > > > > My question was more about who and how this group is created,
> > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > or other HW.
> > > > > > > 
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > from the Rx queues _attached_)
> > > > > > > 
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > 
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > 
> > > > > Which rte_flow pattern/action for this?
> > > > 
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > 
> > > See below.
> > > 
> > > > 
> > > > > 
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > > 
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > 
> > > > > Are you doing this feature based on any HW support or it just pure
> > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > the multiple ports of same the driver.
> > > > 
> > > > Based on HW support.
> > > 
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > 
> > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > rx_burst from the new HW structure.
> > > 
> > > Could you outline your HW support?
> > > 
> > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > we need to deal new HW structure.
> > > 
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > 
> > 
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > 
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > mbufs for each rxq, just feed the shared rxq.
> > 
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple
> RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.
> 
> > 
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > 
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > change in fastpath or ABI is required in this model.
> > > 
> > 
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> > 
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 

The aggregator ethdev required by user is a port, maybe it good to add
a callback for PMD to prepare a complete ethdev just like creating
representor ethdev - pmd register new port internally. If the PMD
doens't provide the callback, ethdev api fallback to initialize an
empty ethdev by copy rxq data(shared) and rx_burst api from source port
and share group. Actually users can do this fallback themselves or with
an util api.

IIUC, an aggregator ethdev not a must, do you think we can continue and
leave that design in later stage? 

> > 
> > > 
> > > 
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > +to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > 
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:59             ` Xueming(Steven) Li
  2021-08-12 14:35               ` Xueming(Steven) Li
@ 2021-09-15 15:34               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-15 15:34 UTC (permalink / raw)
  To: jerinjacobk, ferruh.yigit
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Wed, 2021-08-11 at 12:59 +0000, Xueming(Steven) Li wrote:
> 
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > 
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > 
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > > > > > queue
> > > > > > 
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > > 
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > > 
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID.
> > > > > > 
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > 
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > 
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > 
> > > > > Not quite sure that I understood your question. The control path of
> > > > > is almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > > 
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > > 
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> > forwarding engine. Will sent patches soon.
> > > 
> > 
> > All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> > happen if there are multiple cores polling, won't it cause problem?
> 
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.
> 
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
> 
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.
> 
> If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
> 
> > 
> > And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> > application?
> 
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling. 
> This can be done as a wrapper PMD later, more efforts.

For people want to use shared rxq to save memory, they need to be
conscious on the core polling rule: dedicate core for shared rxq, like
the rule for rxq and txq.

I'm afraid specific changes in application is a must, but not too much:
polling one port per group is sufficient. Protections in data plane
will definitely hurt performance :(

> 
> > 
> > Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> > scope can reduce the complexity it brings?
> 
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also 
> introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
> 
As discussion with Jerin below, the major memory consumed by PF or
representor is mbufs pre-filled to rxq. PMD can't assume all
representors shares same memory pool, or share rxqs internally in PMD -
user might schedule representors to different cores. Defining shared
rxq flag and group looks a good direction.

> It's great if any new solution/suggestion, my head buried in PMD code :)
> 
> > 
> > > > If this offload is only useful for representor case, Can we make this
> > > > offload specific to representor the case by changing its name and scope.
> > > 
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > 
> > > > > > > 
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > 
> > > > > > >  Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > >  Queue start/stop     =
> > > > > > >  Runtime Rx queue setup =
> > > > > > >  Runtime Tx queue setup =
> > > > > > > +Shared Rx queue      =
> > > > > > >  Burst mode info      =
> > > > > > >  Power mgmt address monitor =
> > > > > > >  MTU update           =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > 
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > > 
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > +  representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > >  Basic SR-IOV
> > > > > > >  ------------
> > > > > > > 
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > >  };
> > > > > > > 
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > 
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-09-15 14:45                     ` Xueming(Steven) Li
@ 2021-09-16  4:16                       ` Jerin Jacob
  2021-09-28  5:50                         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-16  4:16 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi Jerin,
>
> On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > queue
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > Queue index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > ---
> > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > >
> > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > >
> > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > + index in switch domain. */
> > > > > > > > > >
> > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > How this group is created?
> > > > > > > > >
> > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > >
> > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > to support group other than default.
> > > > > > > > >
> > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > >
> > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > >
> > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > >
> > > > > > > > My question was more about who and how this group is created,
> > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > or other HW.
> > > > > > > >
> > > > > > > > - Create aggregation queue group
> > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > from the Rx queues _attached_)
> > > > > > > >
> > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > >
> > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > >
> > > > > > Which rte_flow pattern/action for this?
> > > > >
> > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > >
> > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > >
> > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > the multiple ports of same the driver.
> > > > >
> > > > > Based on HW support.
> > > >
> > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > >
> > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > rx_burst from the new HW structure.
> > > >
> > > > Could you outline your HW support?
> > > >
> > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > we need to deal new HW structure.
> > > >
> > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > >
> > >
> > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > >
> > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > mbufs for each rxq, just feed the shared rxq.
> > >
> > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > The memory required to setup each rxq doesn't change too much, agree.
> >
> > We can ask the application to configure the same mempool for multiple
> > RQ too. RIght? If the saving is based on sharing the mempool
> > with multiple RQs.
> >
> > >
> > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > >
> > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > change in fastpath or ABI is required in this model.
> > > >
> > >
> > > This could be an option to access shared rxq. What's the difference of the new PMD?
> >
> > No ABI and fast change are required.
> >
> > > What's the difference of PMD driver to create the new device?
> > >
> > > Is it important in your implementation? Does it work with existing rx_burst api?
> >
> > Yes . It will work with the existing rx_burst API.
> >
>
> The aggregator ethdev required by user is a port, maybe it good to add
> a callback for PMD to prepare a complete ethdev just like creating
> representor ethdev - pmd register new port internally. If the PMD
> doens't provide the callback, ethdev api fallback to initialize an
> empty ethdev by copy rxq data(shared) and rx_burst api from source port
> and share group. Actually users can do this fallback themselves or with
> an util api.
>
> IIUC, an aggregator ethdev not a must, do you think we can continue and
> leave that design in later stage?


IMO aggregator ethdev reduces the complexity for application hence
avoid any change in
test application etc. IMO, I prefer to take that. I will leave the
decision to ethdev maintainers.


>
> > >
> > > >
> > > >
> > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > but some user might prefer grouping some hot
> > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > Anyway, welcome any suggestion.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >         /**
> > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > +to save memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > >
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 0/8] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (2 preceding siblings ...)
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
@ 2021-09-17  8:01 ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
                     ` (7 more replies)
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                   ` (12 subsequent siblings)
  16 siblings, 8 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

In current DPDK framework, all RX queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared RX queue. PF and representors with same
configuration in same switch domain could share RX queue set by
specifying shared Rx queue offloading flag and sharing group.

All ports that Shared Rx queue actually shares One Rx queue and only
pre-load mbufs to one Rx queue, memory is saved.

Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of polling two share groups:
  core	group	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	1	0
  5	1	1
  6	1	2
  7	1	3

Shared RX queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggerate shared rxq group

Xiaoyu Min (1):
  app/testpmd: add common fwd wrapper

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: new API to aggregate shared Rx queue group
  app/testpmd: dump port and queue info for each packet
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: improve forwarding cache miss
  app/testpmd: support shared Rx queue forwarding

 app/test-pmd/5tswap.c                         |  25 +---
 app/test-pmd/config.c                         | 120 +++++++++++++++++-
 app/test-pmd/csumonly.c                       |  25 +---
 app/test-pmd/flowgen.c                        |  26 ++--
 app/test-pmd/icmpecho.c                       |  30 ++---
 app/test-pmd/ieee1588fwd.c                    |  30 +++--
 app/test-pmd/iofwd.c                          |  24 +---
 app/test-pmd/macfwd.c                         |  24 +---
 app/test-pmd/macswap.c                        |  23 +---
 app/test-pmd/noisy_vnf.c                      |   2 +-
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/rxonly.c                         |  32 ++---
 app/test-pmd/testpmd.c                        |  91 ++++++++++++-
 app/test-pmd/testpmd.h                        |  47 ++++++-
 app/test-pmd/txonly.c                         |   8 +-
 app/test-pmd/util.c                           |   1 +
 doc/guides/nics/features.rst                  |  11 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/testpmd_app_ug/run_app.rst         |   5 +
 lib/ethdev/ethdev_driver.h                    |  23 +++-
 lib/ethdev/rte_ethdev.c                       |  23 ++++
 lib/ethdev/rte_ethdev.h                       |  23 ++++
 lib/ethdev/version.map                        |   3 +
 24 files changed, 432 insertions(+), 188 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-27 23:53     ` Ajit Khaparde
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a7c090ce79..b3a58d5e65 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-26 17:54     ` Ajit Khaparde
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet Xueming Li
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ray Kinsella

This patch introduces new api to aggreated ports among same shared Rx
queue group.  Only queues with specified share group is aggregated.
Rx burst and device close are expected to be supported by new device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 23 ++++++++++++++++++++++-
 lib/ethdev/rte_ethdev.c    | 22 ++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 16 ++++++++++++++++
 lib/ethdev/version.map     |  3 +++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 524757cf6f..72156a4153 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -786,10 +786,28 @@ typedef int (*eth_get_monitor_addr_t)(void *rxq,
  * @return
  *   Negative errno value on error, number of info entries otherwise.
  */
-
 typedef int (*eth_representor_info_get_t)(struct rte_eth_dev *dev,
 	struct rte_eth_representor_info *info);
 
+/**
+ * @internal
+ * Aggregate shared Rx queue.
+ *
+ * Create a new port used for shared Rx queue polling.
+ *
+ * Only queues with specified share group are aggregated.
+ * At least Rx burst and device close should be supported.
+ *
+ * @param dev
+ *   Ethdev handle of port.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+typedef int (*eth_shared_rxq_aggregate_t)(struct rte_eth_dev *dev,
+					  uint32_t group);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -950,6 +968,9 @@ struct eth_dev_ops {
 
 	eth_representor_info_get_t representor_info_get;
 	/**< Get representor info. */
+
+	eth_shared_rxq_aggregate_t shared_rxq_aggregate;
+	/**< Aggregate shared Rx queue. */
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index b3a58d5e65..9f2ef58309 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -6301,6 +6301,28 @@ rte_eth_representor_info_get(uint16_t port_id,
 	return eth_err(port_id, (*dev->dev_ops->representor_info_get)(dev, info));
 }
 
+uint16_t
+rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group)
+{
+	struct rte_eth_dev *dev;
+	uint64_t offloads;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->shared_rxq_aggregate,
+				UINT16_MAX);
+
+	offloads = dev->data->dev_conf.rxmode.offloads;
+	if ((offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ) == 0) {
+		RTE_ETHDEV_LOG(ERR, "port_id=%u doesn't support Rx offload\n",
+			       port_id);
+		return UINT16_MAX;
+	}
+
+	return (*dev->dev_ops->shared_rxq_aggregate)(dev, group);
+}
+
 RTE_LOG_REGISTER_DEFAULT(rte_eth_dev_logtype, INFO);
 
 RTE_INIT(ethdev_init_telemetry)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index a578c9db9d..f15d2142b2 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -4895,6 +4895,22 @@ __rte_experimental
 int rte_eth_representor_info_get(uint16_t port_id,
 				 struct rte_eth_representor_info *info);
 
+/**
+ * Aggregate shared Rx queue ports to one port for polling.
+ *
+ * Only queues with specified share group is aggregated.
+ * Any operation besides Rx burst and device close is unexpected.
+ *
+ * @param port_id
+ *   The port identifier of the device from shared Rx queue group.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+__rte_experimental
+uint16_t rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group);
+
 #include <rte_ethdev_core.h>
 
 /**
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 3eece75b72..97a2233508 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -249,6 +249,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_shared_rxq_aggregate;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

In case of shared Rx queue, port number of mbufs returned from one rx
burst could be different.

To support shared Rx queue, this patch dumps mbuf->port and queue for
each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 14a9a251fb..b85fbf75a5 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,7 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ", mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (2 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core Xueming Li
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, RX queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 18 ++++++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  5 +++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index f5765b34f7..8ec5f87ef3 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2707,7 +2707,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e32..de0f1d28cc 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			0, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17e..417e92ade1 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -1506,6 +1511,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share > 0 &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3401,6 +3411,14 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ)) {
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+			port->rx_conf[qid].shared_group = nb_ports / rxq_share;
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f..3dfaaad94c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff7..1b9f715608 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,11 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create all queues in shared RX queue mode if device supports.
+    Group number grows per X ports, default to group 0 if X not specified.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (3 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

Shared rxqs shares one set rx queue of groups zero. Shared Rx queue must
must be polled from one core.

Checks and stops forwarding if shared rxq being scheduled on multiple
cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 96 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 8ec5f87ef3..035247c33f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2883,6 +2883,102 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t shared_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->shared_group != shared_group)
+				continue;
+			printf("Shared RX queue group %u can't be scheduled on different cores:\n",
+			       shared_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->shared_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 417e92ade1..cab4b36b04 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2241,10 +2241,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c..f121a2da90 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (4 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17 11:24     ` Jerin Jacob
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding Xueming Li
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: Xiaoyu Min, xuemingl, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Added common forwarding wrapper function for all fwd engines
which do the following in common:

- record core cycles
- call rte_eth_rx_burst(...,nb_pkt_per_burst)
- update received packets
- handle received mbufs with callback function

For better performance, the function is defined as macro.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/5tswap.c   | 25 +++++--------------------
 app/test-pmd/csumonly.c | 25 ++++++-------------------
 app/test-pmd/flowgen.c  | 20 +++++---------------
 app/test-pmd/icmpecho.c | 30 ++++++++----------------------
 app/test-pmd/iofwd.c    | 24 +++++-------------------
 app/test-pmd/macfwd.c   | 24 +++++-------------------
 app/test-pmd/macswap.c  | 23 +++++------------------
 app/test-pmd/rxonly.c   | 32 ++++++++------------------------
 app/test-pmd/testpmd.h  | 19 +++++++++++++++++++
 9 files changed, 66 insertions(+), 156 deletions(-)

diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
index e8cef9623b..8fe940294f 100644
--- a/app/test-pmd/5tswap.c
+++ b/app/test-pmd/5tswap.c
@@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
  * Parses each layer and swaps it. When the next layer doesn't match it stops.
  */
 static void
-pkt_burst_5tuple_swap(struct fwd_stream *fs)
+_5tuple_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf *mb;
 	uint16_t next_proto;
 	uint64_t ol_flags;
 	uint16_t proto;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-
 	int i;
 	union {
 		struct rte_ether_hdr *eth;
@@ -105,20 +103,6 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 		uint8_t *byte;
 	} h;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
 	vlan_qinq_set(pkts_burst, nb_rx, ol_flags,
@@ -182,12 +166,13 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(_5tuple_swap_stream);
+
 struct fwd_engine five_tuple_swap_fwd_engine = {
 	.fwd_mode_name  = "5tswap",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_5tuple_swap,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 38cc256533..9bfc7d10dc 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -763,7 +763,7 @@ pkt_copy_split(const struct rte_mbuf *pkt)
 }
 
 /*
- * Receive a burst of packets, and for each packet:
+ * For each packet in received mbuf:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
  *  - reprocess the checksum of all supported layers. This is done in SW
@@ -792,9 +792,9 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * OUTER_IP is only useful for tunnel packets.
  */
 static void
-pkt_burst_checksum_forward(struct fwd_stream *fs)
+checksum_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *gso_segments[GSO_MAX_PKT_BURST];
 	struct rte_gso_ctx *gso_ctx;
 	struct rte_mbuf **tx_pkts_burst;
@@ -805,7 +805,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	void **gro_ctx;
 	uint16_t gro_pkts_num;
 	uint8_t gro_enable;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_prep;
 	uint16_t i;
@@ -820,18 +819,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_segments = 0;
 	int ret;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/* receive a burst of packet */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	rx_bad_ip_csum = 0;
 	rx_bad_l4_csum = 0;
 	rx_bad_outer_l4_csum = 0;
@@ -1138,13 +1125,13 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(tx_pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(checksum_forward_stream);
+
 struct fwd_engine csum_fwd_engine = {
 	.fwd_mode_name  = "csum",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_checksum_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index 0d3664a64d..aa45948b4c 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -61,10 +61,10 @@ RTE_DEFINE_PER_LCORE(int, _next_flow);
  * still do so in order to maintain traffic statistics.
  */
 static void
-pkt_burst_flow_gen(struct fwd_stream *fs)
+flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
 	unsigned pkt_size = tx_pkt_length - 4;	/* Adjust FCS */
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_mempool *mbp;
 	struct rte_mbuf  *pkt = NULL;
 	struct rte_ether_hdr *eth_hdr;
@@ -72,7 +72,6 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	struct rte_udp_hdr *udp_hdr;
 	uint16_t vlan_tci, vlan_tci_outer;
 	uint64_t ol_flags = 0;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_dropped;
 	uint16_t nb_pkt;
@@ -80,17 +79,9 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	uint16_t i;
 	uint32_t retry;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 	int next_flow = RTE_PER_LCORE(_next_flow);
 
-	get_start_cycles(&start_tsc);
-
-	/* Receive a burst of packets and discard them. */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
-	fs->rx_packets += nb_rx;
-
 	for (i = 0; i < nb_rx; i++)
 		rte_pktmbuf_free(pkts_burst[i]);
 
@@ -195,12 +186,11 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_pkt);
 	}
-
 	RTE_PER_LCORE(_next_flow) = next_flow;
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(flow_gen_stream);
+
 static void
 flowgen_begin(portid_t pi)
 {
@@ -211,5 +201,5 @@ struct fwd_engine flow_gen_engine = {
 	.fwd_mode_name  = "flowgen",
 	.port_fwd_begin = flowgen_begin,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_flow_gen,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 8948f28eb5..467ba330aa 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -267,13 +267,13 @@ ipv4_hdr_cksum(struct rte_ipv4_hdr *ip_h)
 	(((rte_be_to_cpu_32((ipv4_addr)) >> 24) & 0x000000FF) == 0xE0)
 
 /*
- * Receive a burst of packets, lookup for ICMP echo requests, and, if any,
- * send back ICMP echo replies.
+ * Lookup for ICMP echo requests in received mbuf and, if any,
+ * send back ICMP echo replies to corresponding Tx port.
  */
 static void
-reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+reply_to_icmp_echo_rqsts_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *pkt;
 	struct rte_ether_hdr *eth_h;
 	struct rte_vlan_hdr *vlan_h;
@@ -283,7 +283,6 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	struct rte_ether_addr eth_addr;
 	uint32_t retry;
 	uint32_t ip_addr;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_replies;
 	uint16_t eth_type;
@@ -291,22 +290,9 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	uint16_t arp_op;
 	uint16_t arp_pro;
 	uint32_t cksum;
-	uint8_t  i;
+	uint16_t  i;
 	int l2_len;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * First, receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		if (likely(i < nb_rx - 1))
@@ -509,13 +495,13 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 			} while (++nb_tx < nb_replies);
 		}
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(reply_to_icmp_echo_rqsts_stream);
+
 struct fwd_engine icmp_echo_engine = {
 	.fwd_mode_name  = "icmpecho",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = reply_to_icmp_echo_rqsts,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/iofwd.c b/app/test-pmd/iofwd.c
index 83d098adcb..dbd78167b4 100644
--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -44,25 +44,11 @@
  * to packets data.
  */
 static void
-pkt_burst_io_forward(struct fwd_stream *fs)
+io_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		  struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-	fs->rx_packets += nb_rx;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
 			pkts_burst, nb_rx);
@@ -85,13 +71,13 @@ pkt_burst_io_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(io_forward_stream);
+
 struct fwd_engine io_fwd_engine = {
 	.fwd_mode_name  = "io",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_io_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 0568ea794d..b0728c7597 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -44,32 +44,18 @@
  * before forwarding them.
  */
 static void
-pkt_burst_mac_forward(struct fwd_stream *fs)
+mac_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf  *mb;
 	struct rte_ether_hdr *eth_hdr;
 	uint32_t retry;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
 	uint64_t ol_flags = 0;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	tx_offloads = txp->dev_conf.txmode.offloads;
 	if (tx_offloads	& DEV_TX_OFFLOAD_VLAN_INSERT)
@@ -116,13 +102,13 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(mac_forward_stream);
+
 struct fwd_engine mac_fwd_engine = {
 	.fwd_mode_name  = "mac",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_mac_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 310bca06af..cc208944d7 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -50,27 +50,13 @@
  * addresses of packets before forwarding them.
  */
 static void
-pkt_burst_mac_swap(struct fwd_stream *fs)
+mac_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 
 	do_macswap(pkts_burst, nb_rx, txp);
@@ -95,12 +81,13 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(mac_swap_stream);
+
 struct fwd_engine mac_swap_engine = {
 	.fwd_mode_name  = "macswap",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_mac_swap,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index c78fc4609a..a7354596b5 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -41,37 +41,21 @@
 #include "testpmd.h"
 
 /*
- * Received a burst of packets.
+ * Process a burst of received packets from same stream.
  */
 static void
-pkt_burst_receive(struct fwd_stream *fs)
+rxonly_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		      struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
-	uint16_t i;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
-	for (i = 0; i < nb_rx; i++)
-		rte_pktmbuf_free(pkts_burst[i]);
-
-	get_end_cycles(fs, start_tsc);
+	RTE_SET_USED(fs);
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
 }
 
+PKT_BURST_FWD(rxonly_forward_stream)
+
 struct fwd_engine rx_only_engine = {
 	.fwd_mode_name  = "rxonly",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_receive,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90..4792bef03b 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1028,6 +1028,25 @@ void add_tx_dynf_callback(portid_t portid);
 void remove_tx_dynf_callback(portid_t portid);
 int update_jumbo_frame_offload(portid_t portid);
 
+#define PKT_BURST_FWD(cb)                                       \
+static void                                                     \
+pkt_burst_fwd(struct fwd_stream *fs)                            \
+{                                                               \
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
+	uint16_t nb_rx;                                         \
+	uint64_t start_tsc = 0;                                 \
+								\
+	get_start_cycles(&start_tsc);                           \
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
+			pkts_burst, nb_pkt_per_burst);          \
+	inc_rx_burst_stats(fs, nb_rx);                          \
+	if (unlikely(nb_rx == 0))                               \
+		return;                                         \
+	fs->rx_packets += nb_rx;                                \
+	cb(fs, nb_rx, pkts_burst);                              \
+	get_end_cycles(fs, start_tsc);                          \
+}
+
 /*
  * Work-around of a compilation error with ICC on invocations of the
  * rte_be_to_cpu_16() function.
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (5 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding Xueming Li
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

To minimize cache miss, adds flags and burst size used in forwarding to
stream, moves condition tests in forwarding to flags in stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c    | 18 ++++++++++++++----
 app/test-pmd/flowgen.c   |  6 +++---
 app/test-pmd/noisy_vnf.c |  2 +-
 app/test-pmd/testpmd.h   | 21 ++++++++++++---------
 app/test-pmd/txonly.c    |  8 ++++----
 5 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 035247c33f..5cdf8fa082 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -3050,6 +3050,16 @@ fwd_topology_tx_port_get(portid_t rxp)
 	}
 }
 
+static void
+fwd_stream_set_common(struct fwd_stream *fs)
+{
+	fs->nb_pkt_per_burst = nb_pkt_per_burst;
+	fs->record_burst_stats = !!record_burst_stats;
+	fs->record_core_cycles = !!record_core_cycles;
+	fs->retry_enabled = !!retry_enabled;
+	fs->rxq_share = !!rxq_share;
+}
+
 static void
 simple_fwd_config_setup(void)
 {
@@ -3079,7 +3089,7 @@ simple_fwd_config_setup(void)
 				fwd_ports_ids[fwd_topology_tx_port_get(i)];
 		fwd_streams[i]->tx_queue  = 0;
 		fwd_streams[i]->peer_addr = fwd_streams[i]->tx_port;
-		fwd_streams[i]->retry_enabled = retry_enabled;
+		fwd_stream_set_common(fwd_streams[i]);
 	}
 }
 
@@ -3140,7 +3150,7 @@ rss_fwd_config_setup(void)
 		fs->tx_port = fwd_ports_ids[txp];
 		fs->tx_queue = rxq;
 		fs->peer_addr = fs->tx_port;
-		fs->retry_enabled = retry_enabled;
+		fwd_stream_set_common(fs);
 		rxp++;
 		if (rxp < nb_fwd_ports)
 			continue;
@@ -3255,7 +3265,7 @@ dcb_fwd_config_setup(void)
 				fs->tx_port = fwd_ports_ids[txp];
 				fs->tx_queue = txq + j % nb_tx_queue;
 				fs->peer_addr = fs->tx_port;
-				fs->retry_enabled = retry_enabled;
+				fwd_stream_set_common(fs);
 			}
 			fwd_lcores[lc_id]->stream_nb +=
 				rxp_dcb_info.tc_queue.tc_rxq[i][tc].nb_queue;
@@ -3326,7 +3336,7 @@ icmp_echo_config_setup(void)
 			fs->tx_port = fs->rx_port;
 			fs->tx_queue = rxq;
 			fs->peer_addr = fs->tx_port;
-			fs->retry_enabled = retry_enabled;
+			fwd_stream_set_common(fs);
 			if (verbose_level > 0)
 				printf("  stream=%d port=%d rxq=%d txq=%d\n",
 				       sm_id, fs->rx_port, fs->rx_queue,
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index aa45948b4c..c282f3bcb1 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -97,12 +97,12 @@ flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
 	if (tx_offloads	& DEV_TX_OFFLOAD_MACSEC_INSERT)
 		ol_flags |= PKT_TX_MACSEC;
 
-	for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+	for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 		if (!nb_pkt || !nb_clones) {
 			nb_clones = nb_pkt_flowgen_clones;
 			/* Logic limitation */
-			if (nb_clones > nb_pkt_per_burst)
-				nb_clones = nb_pkt_per_burst;
+			if (nb_clones > fs->nb_pkt_per_burst)
+				nb_clones = fs->nb_pkt_per_burst;
 
 			pkt = rte_mbuf_raw_alloc(mbp);
 			if (!pkt)
diff --git a/app/test-pmd/noisy_vnf.c b/app/test-pmd/noisy_vnf.c
index 382a4c2aae..56bf6a4e70 100644
--- a/app/test-pmd/noisy_vnf.c
+++ b/app/test-pmd/noisy_vnf.c
@@ -153,7 +153,7 @@ pkt_burst_noisy_vnf(struct fwd_stream *fs)
 	uint64_t now;
 
 	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
+			pkts_burst, fs->nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
 		goto flush;
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 4792bef03b..3b8796a7a5 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -128,12 +128,17 @@ struct fwd_stream {
 	queueid_t  tx_queue;  /**< TX queue to send forwarded packets */
 	streamid_t peer_addr; /**< index of peer ethernet address of packets */
 
-	unsigned int retry_enabled;
+	uint16_t nb_pkt_per_burst;
+	unsigned int record_burst_stats:1;
+	unsigned int record_core_cycles:1;
+	unsigned int retry_enabled:1;
+	unsigned int rxq_share:1;
 
 	/* "read-write" results */
 	uint64_t rx_packets;  /**< received packets */
 	uint64_t tx_packets;  /**< received packets transmitted */
 	uint64_t fwd_dropped; /**< received packets not forwarded */
+	uint64_t core_cycles; /**< used for RX and TX processing */
 	uint64_t rx_bad_ip_csum ; /**< received packets has bad ip checksum */
 	uint64_t rx_bad_l4_csum ; /**< received packets has bad l4 checksum */
 	uint64_t rx_bad_outer_l4_csum;
@@ -141,7 +146,6 @@ struct fwd_stream {
 	uint64_t rx_bad_outer_ip_csum;
 	/**< received packets having bad outer ip checksum */
 	unsigned int gro_times;	/**< GRO operation times */
-	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
 	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
@@ -750,28 +754,27 @@ port_pci_reg_write(struct rte_port *port, uint32_t reg_off, uint32_t reg_v)
 static inline void
 get_start_cycles(uint64_t *start_tsc)
 {
-	if (record_core_cycles)
-		*start_tsc = rte_rdtsc();
+	*start_tsc = rte_rdtsc();
 }
 
 static inline void
 get_end_cycles(struct fwd_stream *fs, uint64_t start_tsc)
 {
-	if (record_core_cycles)
+	if (unlikely(fs->record_core_cycles))
 		fs->core_cycles += rte_rdtsc() - start_tsc;
 }
 
 static inline void
 inc_rx_burst_stats(struct fwd_stream *fs, uint16_t nb_rx)
 {
-	if (record_burst_stats)
+	if (unlikely(fs->record_burst_stats))
 		fs->rx_burst_stats.pkt_burst_spread[nb_rx]++;
 }
 
 static inline void
 inc_tx_burst_stats(struct fwd_stream *fs, uint16_t nb_tx)
 {
-	if (record_burst_stats)
+	if (unlikely(fs->record_burst_stats))
 		fs->tx_burst_stats.pkt_burst_spread[nb_tx]++;
 }
 
@@ -1032,13 +1035,13 @@ int update_jumbo_frame_offload(portid_t portid);
 static void                                                     \
 pkt_burst_fwd(struct fwd_stream *fs)                            \
 {                                                               \
-	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
+	struct rte_mbuf *pkts_burst[fs->nb_pkt_per_burst];      \
 	uint16_t nb_rx;                                         \
 	uint64_t start_tsc = 0;                                 \
 								\
 	get_start_cycles(&start_tsc);                           \
 	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
-			pkts_burst, nb_pkt_per_burst);          \
+			pkts_burst, fs->nb_pkt_per_burst);      \
 	inc_rx_burst_stats(fs, nb_rx);                          \
 	if (unlikely(nb_rx == 0))                               \
 		return;                                         \
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index aed820f5d3..db6130421c 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -367,8 +367,8 @@ pkt_burst_transmit(struct fwd_stream *fs)
 	eth_hdr.ether_type = rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4);
 
 	if (rte_mempool_get_bulk(mbp, (void **)pkts_burst,
-				nb_pkt_per_burst) == 0) {
-		for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+				fs->nb_pkt_per_burst) == 0) {
+		for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 			if (unlikely(!pkt_burst_prepare(pkts_burst[nb_pkt], mbp,
 							&eth_hdr, vlan_tci,
 							vlan_tci_outer,
@@ -376,12 +376,12 @@ pkt_burst_transmit(struct fwd_stream *fs)
 							nb_pkt, fs))) {
 				rte_mempool_put_bulk(mbp,
 						(void **)&pkts_burst[nb_pkt],
-						nb_pkt_per_burst - nb_pkt);
+						fs->nb_pkt_per_burst - nb_pkt);
 				break;
 			}
 		}
 	} else {
-		for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+		for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 			pkt = rte_mbuf_raw_alloc(mbp);
 			if (pkt == NULL)
 				break;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (6 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

By enabling shared Rx queue, received packets come from all member ports
in same shared Rx queue.

This patch adds a common forwarding function for shared Rx queue, groups
source forwarding stream by looking into local streams on current lcore
with packet source port(mbuf->port) and queue, then invokes callback to
handle received packets for each source stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/ieee1588fwd.c | 30 +++++++++++------
 app/test-pmd/testpmd.c     | 69 ++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.h     |  9 ++++-
 3 files changed, 97 insertions(+), 11 deletions(-)

diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index 034f238c34..0151d6de74 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -90,23 +90,17 @@ port_ieee1588_tx_timestamp_check(portid_t pi)
 }
 
 static void
-ieee1588_packet_fwd(struct fwd_stream *fs)
+ieee1588_fwd_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkt)
 {
-	struct rte_mbuf  *mb;
+	struct rte_mbuf *mb = (*pkt);
 	struct rte_ether_hdr *eth_hdr;
 	struct rte_ether_addr addr;
 	struct ptpv2_msg *ptp_hdr;
 	uint16_t eth_type;
 	uint32_t timesync_index;
 
-	/*
-	 * Receive 1 packet at a time.
-	 */
-	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
-		return;
-
-	fs->rx_packets += 1;
-
+	RTE_SET_USED(nb_rx);
 	/*
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
@@ -198,6 +192,22 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	port_ieee1588_tx_timestamp_check(fs->rx_port);
 }
 
+/*
+ * Wrapper of real fwd ingine.
+ */
+static void
+ieee1588_packet_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *mb;
+
+	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
+		return;
+	if (unlikely(fs->rxq_share > 0))
+		forward_shared_rxq(fs, 1, &mb, ieee1588_fwd_stream);
+	else
+		ieee1588_fwd_stream(fs, 1, &mb);
+}
+
 static void
 port_ieee1588_fwd_begin(portid_t pi)
 {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index cab4b36b04..1d82397831 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2106,6 +2106,75 @@ flush_fwd_rx_queues(void)
 	}
 }
 
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_by_port(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		struct rte_mbuf **pkts, packet_fwd_cb fwd)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		fwd(fs, nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i + 1 == nb_rx || pkts_burst[i + 1]->port != port) {
+			/* Forward packets with same source port. */
+			forward_by_port(fs, port, nb_fs_rx,
+					&pkts_burst[i + 1 - nb_fs_rx], fwd);
+			nb_fs_rx = 1;
+		} else {
+			nb_fs_rx++;
+		}
+	}
+}
+
 static void
 run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 {
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3b8796a7a5..7869f61f74 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -276,6 +276,8 @@ struct fwd_lcore {
 typedef void (*port_fwd_begin_t)(portid_t pi);
 typedef void (*port_fwd_end_t)(portid_t pi);
 typedef void (*packet_fwd_t)(struct fwd_stream *fs);
+typedef void (*packet_fwd_cb)(struct fwd_stream *fs, uint16_t nb_rx,
+			      struct rte_mbuf **pkts);
 
 struct fwd_engine {
 	const char       *fwd_mode_name; /**< Forwarding mode name. */
@@ -910,6 +912,8 @@ char *list_pkt_forwarding_modes(void);
 char *list_pkt_forwarding_retry_modes(void);
 void set_pkt_forwarding_mode(const char *fwd_mode);
 void start_packet_forwarding(int with_tx_first);
+void forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+			struct rte_mbuf **pkts_burst, packet_fwd_cb fwd);
 void fwd_stats_display(void);
 void fwd_stats_reset(void);
 void stop_packet_forwarding(void);
@@ -1046,7 +1050,10 @@ pkt_burst_fwd(struct fwd_stream *fs)                            \
 	if (unlikely(nb_rx == 0))                               \
 		return;                                         \
 	fs->rx_packets += nb_rx;                                \
-	cb(fs, nb_rx, pkts_burst);                              \
+	if (fs->rxq_share)                                      \
+		forward_shared_rxq(fs, nb_rx, pkts_burst, cb);  \
+	else                                                    \
+		cb(fs, nb_rx, pkts_burst);                      \
 	get_end_cycles(fs, start_tsc);                          \
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
@ 2021-09-17 11:24     ` Jerin Jacob
  0 siblings, 0 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-17 11:24 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Xiaoyu Min, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

On Fri, Sep 17, 2021 at 1:33 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> From: Xiaoyu Min <jackmin@nvidia.com>
>
> Added common forwarding wrapper function for all fwd engines
> which do the following in common:
>
> - record core cycles
> - call rte_eth_rx_burst(...,nb_pkt_per_burst)
> - update received packets
> - handle received mbufs with callback function
>
> For better performance, the function is defined as macro.
>
> Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/5tswap.c   | 25 +++++--------------------
>  app/test-pmd/csumonly.c | 25 ++++++-------------------
>  app/test-pmd/flowgen.c  | 20 +++++---------------
>  app/test-pmd/icmpecho.c | 30 ++++++++----------------------
>  app/test-pmd/iofwd.c    | 24 +++++-------------------
>  app/test-pmd/macfwd.c   | 24 +++++-------------------
>  app/test-pmd/macswap.c  | 23 +++++------------------
>  app/test-pmd/rxonly.c   | 32 ++++++++------------------------
>  app/test-pmd/testpmd.h  | 19 +++++++++++++++++++
>  9 files changed, 66 insertions(+), 156 deletions(-)
>
> diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
> index e8cef9623b..8fe940294f 100644
> --- a/app/test-pmd/5tswap.c
> +++ b/app/test-pmd/5tswap.c
> @@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
>   * Parses each layer and swaps it. When the next layer doesn't match it stops.
>   */

> +PKT_BURST_FWD(_5tuple_swap_stream);

Please make _5tuple_swap_stream aka "cb" as inline function to  make sure
compiler doesn't generate yet another function pointer.

>  struct fwd_engine mac_swap_engine = {
>         .fwd_mode_name  = "macswap",
>         .port_fwd_begin = NULL,
>         .port_fwd_end   = NULL,
> -       .packet_fwd     = pkt_burst_mac_swap,

See below

> +       .packet_fwd     = pkt_burst_fwd,
>
> +#define PKT_BURST_FWD(cb)                                       \

Probably it can pass prefix too like PKT_BURST_FWD(cb, prefix)
to make a unique function and call PKT_BURST_FWD(_5tuple_swap_stream,
mac_swap) for better readability
and avoid diff above section.


> +static void                                                     \
> +pkt_burst_fwd(struct fwd_stream *fs)

pkt_burst_fwd##prefix(struct fwd_stream *fs)
                            \
> +{                                                               \
> +       struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
> +       uint16_t nb_rx;                                         \
> +       uint64_t start_tsc = 0;                                 \
> +                                                               \
> +       get_start_cycles(&start_tsc);                           \
> +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
> +                       pkts_burst, nb_pkt_per_burst);          \
> +       inc_rx_burst_stats(fs, nb_rx);                          \
> +       if (unlikely(nb_rx == 0))                               \
> +               return;                                         \
> +       fs->rx_packets += nb_rx;                                \
> +       cb(fs, nb_rx, pkts_burst);                              \
> +       get_end_cycles(fs, start_tsc);                          \
> +}
> +
>  /*
>   * Work-around of a compilation error with ICC on invocations of the
>   * rte_be_to_cpu_16() function.
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:04           ` Ferruh Yigit
  2021-08-11 12:59             ` Xueming(Steven) Li
@ 2021-09-26  5:35             ` Xueming(Steven) Li
  2021-09-28  9:35               ` Jerin Jacob
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-26  5:35 UTC (permalink / raw)
  To: jerinjacobk, ferruh.yigit
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > 
> > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > for incoming packets. When number of representors scale out in a
> > > > > > switch domain, the memory consumption became significant. Most
> > > > > > important, polling all ports leads to high cache miss, high
> > > > > > latency and low throughput.
> > > > > > 
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > 
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > > 
> > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > 
> > > > > > Multiple groups is supported by group ID.
> > > > > 
> > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > 
> > > > Yes, PF and representor in switch domain could take advantage.
> > > > 
> > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > 
> > > > Not quite sure that I understood your question. The control path of is
> > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > supplied from shared Rx queue in my PMD implementation.
> > > 
> > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > 
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > all forwarding engine. Will sent patches soon.
> > 
> 
> All ports will put the packets in to the same queue (share queue), right? Does
> this means only single core will poll only, what will happen if there are
> multiple cores polling, won't it cause problem?
> 
> And if this requires specific changes in the application, I am not sure about
> the solution, can't this work in a transparent way to the application?

Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
in same group into one new port. Users could schedule polling on the
aggregated port instead of all member ports.

> 
> Overall, is this for optimizing memory for the port represontors? If so can't we
> have a port representor specific solution, reducing scope can reduce the
> complexity it brings?
> 
> > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > scope.
> > 
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > 
> > > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > ---
> > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > >  5 files changed, 30 insertions(+)
> > > > > > 
> > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > --- a/doc/guides/nics/features.rst
> > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > 
> > > > > > 
> > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > +
> > > > > > +Shared Rx queue
> > > > > > +---------------
> > > > > > +
> > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > +
> > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > +
> > > > > > +
> > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > 
> > > > > >  Packet type parsing
> > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > b/doc/guides/nics/features/default.ini
> > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > >  Queue start/stop     =
> > > > > >  Runtime Rx queue setup =
> > > > > >  Runtime Tx queue setup =
> > > > > > +Shared Rx queue      =
> > > > > >  Burst mode info      =
> > > > > >  Power mgmt address monitor =
> > > > > >  MTU update           =
> > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > 
> > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > `_
> > > > > > 
> > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > +grows,
> > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > +miss and
> > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > +PF and
> > > > > > +  representors in same switch domain.
> > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > +the
> > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > +enable
> > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > +return
> > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > +
> > > > > >  Basic SR-IOV
> > > > > >  ------------
> > > > > > 
> > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > >  };
> > > > > > 
> > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > > >         /**
> > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > +memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > 
> > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
@ 2021-09-26 17:54     ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-09-26 17:54 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ray Kinsella

[-- Attachment #1: Type: text/plain, Size: 4651 bytes --]

On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> This patch introduces new api to aggreated ports among same shared Rx
s/aggregated/aggregate

> queue group.  Only queues with specified share group is aggregated.
s/is/are

> Rx burst and device close are expected to be supported by new device.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Minor nits - typos actually!

> ---
>  lib/ethdev/ethdev_driver.h | 23 ++++++++++++++++++++++-
>  lib/ethdev/rte_ethdev.c    | 22 ++++++++++++++++++++++
>  lib/ethdev/rte_ethdev.h    | 16 ++++++++++++++++
>  lib/ethdev/version.map     |  3 +++
>  4 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 524757cf6f..72156a4153 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -786,10 +786,28 @@ typedef int (*eth_get_monitor_addr_t)(void *rxq,
>   * @return
>   *   Negative errno value on error, number of info entries otherwise.
>   */
> -
>  typedef int (*eth_representor_info_get_t)(struct rte_eth_dev *dev,
>         struct rte_eth_representor_info *info);
>
> +/**
> + * @internal
> + * Aggregate shared Rx queue.
> + *
> + * Create a new port used for shared Rx queue polling.
> + *
> + * Only queues with specified share group are aggregated.
> + * At least Rx burst and device close should be supported.
> + *
> + * @param dev
> + *   Ethdev handle of port.
> + * @param group
> + *   Shared Rx queue group to aggregate.
> + * @return
> + *   UINT16_MAX if failed, otherwise aggregated port number.
> + */
> +typedef int (*eth_shared_rxq_aggregate_t)(struct rte_eth_dev *dev,
> +                                         uint32_t group);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -950,6 +968,9 @@ struct eth_dev_ops {
>
>         eth_representor_info_get_t representor_info_get;
>         /**< Get representor info. */
> +
> +       eth_shared_rxq_aggregate_t shared_rxq_aggregate;
> +       /**< Aggregate shared Rx queue. */
>  };
>
>  /**
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index b3a58d5e65..9f2ef58309 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -6301,6 +6301,28 @@ rte_eth_representor_info_get(uint16_t port_id,
>         return eth_err(port_id, (*dev->dev_ops->representor_info_get)(dev, info));
>  }
>
> +uint16_t
> +rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group)
> +{
> +       struct rte_eth_dev *dev;
> +       uint64_t offloads;
> +
> +       RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +       dev = &rte_eth_devices[port_id];
> +
> +       RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->shared_rxq_aggregate,
> +                               UINT16_MAX);
> +
> +       offloads = dev->data->dev_conf.rxmode.offloads;
> +       if ((offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ) == 0) {
> +               RTE_ETHDEV_LOG(ERR, "port_id=%u doesn't support Rx offload\n",
> +                              port_id);
> +               return UINT16_MAX;
> +       }
> +
> +       return (*dev->dev_ops->shared_rxq_aggregate)(dev, group);
> +}
> +
>  RTE_LOG_REGISTER_DEFAULT(rte_eth_dev_logtype, INFO);
>
>  RTE_INIT(ethdev_init_telemetry)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index a578c9db9d..f15d2142b2 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -4895,6 +4895,22 @@ __rte_experimental
>  int rte_eth_representor_info_get(uint16_t port_id,
>                                  struct rte_eth_representor_info *info);
>
> +/**
> + * Aggregate shared Rx queue ports to one port for polling.
> + *
> + * Only queues with specified share group is aggregated.
s/is/are

> + * Any operation besides Rx burst and device close is unexpected.
> + *
> + * @param port_id
> + *   The port identifier of the device from shared Rx queue group.
> + * @param group
> + *   Shared Rx queue group to aggregate.
> + * @return
> + *   UINT16_MAX if failed, otherwise aggregated port number.
> + */
> +__rte_experimental
> +uint16_t rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group);
> +
>  #include <rte_ethdev_core.h>
>
>  /**
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 3eece75b72..97a2233508 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -249,6 +249,9 @@ EXPERIMENTAL {
>         rte_mtr_meter_policy_delete;
>         rte_mtr_meter_policy_update;
>         rte_mtr_meter_policy_validate;
> +
> +       # added in 21.11
> +       rte_eth_shared_rxq_aggregate;
>  };
>
>  INTERNAL {
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
@ 2021-09-27 23:53     ` Ajit Khaparde
  2021-09-28 14:24       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ajit Khaparde @ 2021-09-27 23:53 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

[-- Attachment #1: Type: text/plain, Size: 5910 bytes --]

On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue must be polled on single thread or core.
>
> Multiple groups is supported by group ID.
Can you clarify this a little more?

Apologies if this was already covered:
* Can't we do this for Tx also?

Couple of nits inline. Thanks

>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>
> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index a96e12d155..2e2a9b1554 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4..ebeb4c1851 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c80..45bf5a3a10 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``

"in the same switch"

> +  is present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of shared Rx queue can return
"of the shared Rx queue.."

> +  packets of all ports in group, port ID is saved in ``mbuf.port``.

"ports in the group, "

> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index a7c090ce79..b3a58d5e65 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.

"Any port in the group can"


> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-09-16  4:16                       ` Jerin Jacob
@ 2021-09-28  5:50                         ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28  5:50 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Thu, 2021-09-16 at 09:46 +0530, Jerin Jacob wrote:
> On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > Hi Jerin,
> > 
> > On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > > queue
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > > 
> > > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > > Queue index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > 
> > > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > > > 
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > > > 
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > > + index in switch domain. */
> > > > > > > > > > > 
> > > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > > How this group is created?
> > > > > > > > > > 
> > > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > > > 
> > > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > > to support group other than default.
> > > > > > > > > > 
> > > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > > > 
> > > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > > > 
> > > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > > > 
> > > > > > > > > My question was more about who and how this group is created,
> > > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > > or other HW.
> > > > > > > > > 
> > > > > > > > > - Create aggregation queue group
> > > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > > from the Rx queues _attached_)
> > > > > > > > > 
> > > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > > > 
> > > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > > > 
> > > > > > > Which rte_flow pattern/action for this?
> > > > > > 
> > > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > > > 
> > > > > See below.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > > > 
> > > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > > > 
> > > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > > the multiple ports of same the driver.
> > > > > > 
> > > > > > Based on HW support.
> > > > > 
> > > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > > > 
> > > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > > rx_burst from the new HW structure.
> > > > > 
> > > > > Could you outline your HW support?
> > > > > 
> > > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > > we need to deal new HW structure.
> > > > > 
> > > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > > > 
> > > > 
> > > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > > > 
> > > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > > mbufs for each rxq, just feed the shared rxq.
> > > > 
> > > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > > The memory required to setup each rxq doesn't change too much, agree.
> > > 
> > > We can ask the application to configure the same mempool for multiple
> > > RQ too. RIght? If the saving is based on sharing the mempool
> > > with multiple RQs.
> > > 
> > > > 
> > > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > > > 
> > > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > > change in fastpath or ABI is required in this model.
> > > > > 
> > > > 
> > > > This could be an option to access shared rxq. What's the difference of the new PMD?
> > > 
> > > No ABI and fast change are required.
> > > 
> > > > What's the difference of PMD driver to create the new device?
> > > > 
> > > > Is it important in your implementation? Does it work with existing rx_burst api?
> > > 
> > > Yes . It will work with the existing rx_burst API.
> > > 
> > 
> > The aggregator ethdev required by user is a port, maybe it good to add
> > a callback for PMD to prepare a complete ethdev just like creating
> > representor ethdev - pmd register new port internally. If the PMD
> > doens't provide the callback, ethdev api fallback to initialize an
> > empty ethdev by copy rxq data(shared) and rx_burst api from source port
> > and share group. Actually users can do this fallback themselves or with
> > an util api.
> > 
> > IIUC, an aggregator ethdev not a must, do you think we can continue and
> > leave that design in later stage?
> 
> 
> IMO aggregator ethdev reduces the complexity for application hence
> avoid any change in
> test application etc. IMO, I prefer to take that. I will leave the
> decision to ethdev maintainers.

Hi Jerin, new API added for aggregator, the last one in v3, thanks! 

> 
> 
> > 
> > > > 
> > > > > 
> > > > > 
> > > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > > but some user might prefer grouping some hot
> > > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > > Anyway, welcome any suggestion.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > > +to save memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > > 
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > > \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-09-01 14:44                 ` Xueming(Steven) Li
@ 2021-09-28  5:54                   ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28  5:54 UTC (permalink / raw)
  To: jerinjacobk; +Cc: xiaoyun.li, Jack Min, dev

On Wed, 2021-09-01 at 14:44 +0000, Xueming(Steven) Li wrote:
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> > Sent: Sunday, August 29, 2021 3:08 PM
> > To: Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:28 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > > <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > > 
> > > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > > common fwd wrapper function
> > > > > > > 
> > > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > 
> > > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > > which do the following in common:
> > > > > > > > 
> > > > > > > > 1. get_start_cycles
> > > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > > get_end_cycle
> > > > > > > > 
> > > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > ---
> > > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > > >  1 file changed, 24 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > > index
> > > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > > > 
> > > > > > > > +static inline void
> > > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > > +       uint16_t nb_rx;
> > > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > > +
> > > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > > +        */
> > > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > > +               return;
> > > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > > > 
> > > > > > > See below. It reads a global memory.
> > > > > > > 
> > > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > > +       else
> > > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > > > 
> > > > > > > New function pointer in fastpath.
> > > > > > > 
> > > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > > Can we have a new forward engine just for shared memory testing?
> > > > > > 
> > > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > > > 
> > > > > Are you saying there is zero % regression? If not, could you share the data?
> > > > 
> > > > Almost zero, here is a quick single core result of rxonly:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > Revert the patch to rxonly.c:
> > > >         32.1Mpps 59.9cycles/packet
> > > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > There were small digit jumps between testpmd restart, I picked the best one.
> > > > The result is almost same, seems the cost of each packet is small enough.
> > > > BTW, I'm testing with default burst size and queue depth.
> > > 
> > > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > > Without this patch - 73.5mpps With this patch - 72.8 mpps
> > > 
> > > We are taking the shared queue runtime option without a separate fwd engine.
> > > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > > Example change to outline function template principle.
> > > 
> > > static inline
> > > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> > > 
> > > Introduce new checks under
> > > if (flags & SHARED_QUEUE)
> > > 
> > > 
> > > }
> > > 
> > > Have two versions of io_fwd_engine.packet_fwd per engine.
> > > 
> > > - first version
> > > static pkt_burst_io_forward(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, 0); }
> > > 
> > > - Second version
> > > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> > > 
> > > 
> > > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> > > 
> > > If shared offoad is not selected, pkt_burst_io_forward() will be
> > > selected and
> > > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> > 
> > Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> > right? Have you tested this performance? If not, I could verify.
> 
> Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
> Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

The changes reflected in v3, also consolidated patches for each
forwarding engine into one, please check.

> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > From test perspective, better to have all forward engine to
> > > > > > verify shared rxq, test team want to run the regression with
> > > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > > engines
> > > > > seamlessly.
> > > > > 
> > > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > > +
> > > > > > > >  /*
> > > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-26  5:35             ` Xueming(Steven) Li
@ 2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
                                   ` (3 more replies)
  0 siblings, 4 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28  9:35 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: ferruh.yigit, NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > >
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > >
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > >
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > >
> > > > > Not quite sure that I understood your question. The control path of is
> > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > >
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > all forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue), right? Does
> > this means only single core will poll only, what will happen if there are
> > multiple cores polling, won't it cause problem?
> >
> > And if this requires specific changes in the application, I am not sure about
> > the solution, can't this work in a transparent way to the application?
>
> Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> in same group into one new port. Users could schedule polling on the
> aggregated port instead of all member ports.

The v3 still has testpmd changes in fastpath. Right? IMO, For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated ports probably as
separate fwd_engine to show how to use this feature.

>
> >
> > Overall, is this for optimizing memory for the port represontors? If so can't we
> > have a port representor specific solution, reducing scope can reduce the
> > complexity it brings?
> >
> > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > >  5 files changed, 30 insertions(+)
> > > > > > >
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > >
> > > > > > >
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > >
> > > > > > >  Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > >  Queue start/stop     =
> > > > > > >  Runtime Rx queue setup =
> > > > > > >  Runtime Tx queue setup =
> > > > > > > +Shared Rx queue      =
> > > > > > >  Burst mode info      =
> > > > > > >  Power mgmt address monitor =
> > > > > > >  MTU update           =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > >
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > >
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > +  representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > >  Basic SR-IOV
> > > > > > >  ------------
> > > > > > >
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > >  };
> > > > > > >
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > >
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
> >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
@ 2021-09-28 11:36                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:36 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:

On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:

On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:



-----Original Message-----

From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>

Sent: Wednesday, August 11, 2021 4:03 PM

To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>;

Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>

Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue


On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


Hi,


-----Original Message-----

From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>

Sent: Monday, August 9, 2021 9:51 PM

To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>;

NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>; Andrew Rybchenko

<andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>

Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue


On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


In current DPDK framework, each RX queue is pre-loaded with mbufs

for incoming packets. When number of representors scale out in a

switch domain, the memory consumption became significant. Most

important, polling all ports leads to high cache miss, high

latency and low throughput.


This patch introduces shared RX queue. Ports with same

configuration in a switch domain could share RX queue set by specifying sharing group.

Polling any queue using same shared RX queue receives packets from

all member ports. Source port is identified by mbuf->port.


Port queue number in a shared group should be identical. Queue

index is

1:1 mapped in shared group.


Share RX queue is supposed to be polled on same thread.


Multiple groups is supported by group ID.


Is this offload specific to the representor? If so can this name be changed specifically to representor?


Yes, PF and representor in switch domain could take advantage.


If it is for a generic case, how the flow ordering will be maintained?


Not quite sure that I understood your question. The control path of is

almost same as before, PF and representor port still needed, rte flows not impacted.

Queues still needed for each member port, descriptors(mbuf) will be

supplied from shared Rx queue in my PMD implementation.


My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same

receive queue, In that case, how the flow order is maintained for respective receive queues.


I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.

basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.

Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from

limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for

all forwarding engine. Will sent patches soon.



All ports will put the packets in to the same queue (share queue), right? Does

this means only single core will poll only, what will happen if there are

multiple cores polling, won't it cause problem?


And if this requires specific changes in the application, I am not sure about

the solution, can't this work in a transparent way to the application?


Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports

in same group into one new port. Users could schedule polling on the

aggregated port instead of all member ports.


The v3 still has testpmd changes in fastpath. Right? IMO, For this

feature, we should not change fastpath of testpmd

application. Instead, testpmd can use aggregated ports probably as

separate fwd_engine to show how to use this feature.


Good point to discuss :) There are two strategies to polling a shared

Rxq:

1. polling each member port

   All forwarding engines can be reused to work as before.

   My testpmd patches are efforts towards this direction.

   Does your PMD support this?

2. polling aggregated port

   Besides forwarding engine, need more work to to demo it.

   This is an optional API, not supported by my PMD yet.






Overall, is this for optimizing memory for the port represontors? If so can't we

have a port representor specific solution, reducing scope can reduce the

complexity it brings?


If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and

scope.


It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.







Signed-off-by: Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

---

 doc/guides/nics/features.rst                    | 11 +++++++++++

 doc/guides/nics/features/default.ini            |  1 +

 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++

 lib/ethdev/rte_ethdev.c                         |  1 +

 lib/ethdev/rte_ethdev.h                         |  7 +++++++

 5 files changed, 30 insertions(+)


diff --git a/doc/guides/nics/features.rst

b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644

--- a/doc/guides/nics/features.rst

+++ b/doc/guides/nics/features.rst

@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.

   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.



+.. _nic_features_shared_rx_queue:

+

+Shared Rx queue

+---------------

+

+Supports shared Rx queue for ports in same switch domain.

+

+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.

+* **[provides] mbuf**: ``mbuf.port``.

+

+

 .. _nic_features_packet_type_parsing:


 Packet type parsing

diff --git a/doc/guides/nics/features/default.ini

b/doc/guides/nics/features/default.ini

index 754184ddd4..ebeb4c1851 100644

--- a/doc/guides/nics/features/default.ini

+++ b/doc/guides/nics/features/default.ini

@@ -19,6 +19,7 @@ Free Tx mbuf on demand =

 Queue start/stop     =

 Runtime Rx queue setup =

 Runtime Tx queue setup =

+Shared Rx queue      =

 Burst mode info      =

 Power mgmt address monitor =

 MTU update           =

diff --git a/doc/guides/prog_guide/switch_representation.rst

b/doc/guides/prog_guide/switch_representation.rst

index ff6aa91c80..45bf5a3a10 100644

--- a/doc/guides/prog_guide/switch_representation.rst

+++ b/doc/guides/prog_guide/switch_representation.rst

@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.

 .. [1] `Ethernet switch device driver model (switchdev)


<https://www.kernel.org/doc/Documentation/networking/switchdev.txt

`_


+- Memory usage of representors is huge when number of representor

+grows,

+  because PMD always allocate mbuf for each descriptor of Rx queue.

+  Polling the large number of ports brings more CPU load, cache

+miss and

+  latency. Shared Rx queue can be used to share Rx queue between

+PF and

+  representors in same switch domain.

+``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``

+  is present in Rx offloading capability of device info. Setting

+the

+  offloading flag in device Rx mode or Rx queue configuration to

+enable

+  shared Rx queue. Polling any member port of shared Rx queue can

+return

+  packets of all ports in group, port ID is saved in ``mbuf.port``.

+

 Basic SR-IOV

 ------------


diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c

index 9d95cd11e1..1361ff759a 100644

--- a/lib/ethdev/rte_ethdev.c

+++ b/lib/ethdev/rte_ethdev.c

@@ -127,6 +127,7 @@ static const struct {

        RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),

        RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),

        RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),

+       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),

 };


 #undef RTE_RX_OFFLOAD_BIT2STR

diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h

index d2b27c351f..a578c9db9d 100644

--- a/lib/ethdev/rte_ethdev.h

+++ b/lib/ethdev/rte_ethdev.h

@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {

        uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */

        uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */

        uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.

*/

+       uint32_t shared_group; /**< Shared port group index in

+ switch domain. */

        /**

         * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.

         * Only offloads set on rx_queue_offload_capa or

rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {

#define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000

 #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000

 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000

+/**

+ * Rx queue is shared among ports in same switch domain to save

+memory,

+ * avoid polling each port. Any port in group can be used to receive packets.

+ * Real source port number saved in mbuf->port field.

+ */

+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000


 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \

                                 DEV_RX_OFFLOAD_UDP_CKSUM | \

--

2.25.1






^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
@ 2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:37 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 12:58                   ` Jerin Jacob
  2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:37 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 12:58                   ` Jerin Jacob
  2021-09-28 13:25                     ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 12:58 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > >
> > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > >
> > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > >
> > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > >
> > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > >
> > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > all forwarding engine. Will sent patches soon.
> > > > >
> > > >
> > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > this means only single core will poll only, what will happen if there are
> > > > multiple cores polling, won't it cause problem?
> > > >
> > > > And if this requires specific changes in the application, I am not sure about
> > > > the solution, can't this work in a transparent way to the application?
> > >
> > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > in same group into one new port. Users could schedule polling on the
> > > aggregated port instead of all member ports.
> >
> > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > feature, we should not change fastpath of testpmd
> > application. Instead, testpmd can use aggregated ports probably as
> > separate fwd_engine to show how to use this feature.
>
> Good point to discuss :) There are two strategies to polling a shared
> Rxq:
> 1. polling each member port
>    All forwarding engines can be reused to work as before.
>    My testpmd patches are efforts towards this direction.
>    Does your PMD support this?

Not unfortunately. More than that, every application needs to change
to support this model.

> 2. polling aggregated port
>    Besides forwarding engine, need more work to to demo it.
>    This is an optional API, not supported by my PMD yet.

We are thinking of implementing this PMD when it comes to it, ie.
without application change in fastpath
logic.

>
>
> >
> > >
> > > >
> > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > have a port representor specific solution, reducing scope can reduce the
> > > > complexity it brings?
> > > >
> > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > scope.
> > > > >
> > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > ---
> > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > +
> > > > > > > > > +Shared Rx queue
> > > > > > > > > +---------------
> > > > > > > > > +
> > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > +
> > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > >
> > > > > > > > >  Packet type parsing
> > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > >  Queue start/stop     =
> > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > +Shared Rx queue      =
> > > > > > > > >  Burst mode info      =
> > > > > > > > >  Power mgmt address monitor =
> > > > > > > > >  MTU update           =
> > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > >
> > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > `_
> > > > > > > > >
> > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > +grows,
> > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > +miss and
> > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > +PF and
> > > > > > > > > +  representors in same switch domain.
> > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > +the
> > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > +enable
> > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > +return
> > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > >  Basic SR-IOV
> > > > > > > > >  ------------
> > > > > > > > >
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > + switch domain. */
> > > > > > > > >         /**
> > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > +memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > >
> > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >
> > > >
> > >
>
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
                                   ` (2 preceding siblings ...)
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 12:59 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 12:58                   ` Jerin Jacob
@ 2021-09-28 13:25                     ` Xueming(Steven) Li
  2021-09-28 13:38                       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 13:25 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > latency and low throughput.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > 
> > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > 
> > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > 
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > 
> > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > 
> > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > 
> > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > 
> > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > 
> > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > 
> > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > all forwarding engine. Will sent patches soon.
> > > > > > 
> > > > > 
> > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > this means only single core will poll only, what will happen if there are
> > > > > multiple cores polling, won't it cause problem?
> > > > > 
> > > > > And if this requires specific changes in the application, I am not sure about
> > > > > the solution, can't this work in a transparent way to the application?
> > > > 
> > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > in same group into one new port. Users could schedule polling on the
> > > > aggregated port instead of all member ports.
> > > 
> > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > feature, we should not change fastpath of testpmd
> > > application. Instead, testpmd can use aggregated ports probably as
> > > separate fwd_engine to show how to use this feature.
> > 
> > Good point to discuss :) There are two strategies to polling a shared
> > Rxq:
> > 1. polling each member port
> >    All forwarding engines can be reused to work as before.
> >    My testpmd patches are efforts towards this direction.
> >    Does your PMD support this?
> 
> Not unfortunately. More than that, every application needs to change
> to support this model.

Both strategies need user application to resolve port ID from mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling schedule change.

> 
> > 2. polling aggregated port
> >    Besides forwarding engine, need more work to to demo it.
> >    This is an optional API, not supported by my PMD yet.
> 
> We are thinking of implementing this PMD when it comes to it, ie.
> without application change in fastpath
> logic.

Fastpath have to resolve port ID anyway and forwarding according to
logic. Forwarding engine need to adapt to support shard Rxq.
Fortunately, in testpmd, this can be done with an abstract API.

Let's defer part 2 until some PMD really support it and tested, how do
you think?

> 
> > 
> > 
> > > 
> > > > 
> > > > > 
> > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > complexity it brings?
> > > > > 
> > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > scope.
> > > > > > 
> > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > ---
> > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > +
> > > > > > > > > > +Shared Rx queue
> > > > > > > > > > +---------------
> > > > > > > > > > +
> > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > +
> > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > > +
> > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > 
> > > > > > > > > >  Packet type parsing
> > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > >  Queue start/stop     =
> > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > >  Burst mode info      =
> > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > >  MTU update           =
> > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > 
> > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > `_
> > > > > > > > > > 
> > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > +grows,
> > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > +miss and
> > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > +PF and
> > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > +the
> > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > +enable
> > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > +return
> > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > >  Basic SR-IOV
> > > > > > > > > >  ------------
> > > > > > > > > > 
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > >  };
> > > > > > > > > > 
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > + switch domain. */
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > +memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > 
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 
> > > > > 
> > > > 
> > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:25                     ` Xueming(Steven) Li
@ 2021-09-28 13:38                       ` Jerin Jacob
  2021-09-28 13:59                         ` Ananyev, Konstantin
  2021-09-28 14:51                         ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 13:38 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > >
> > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > >
> > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > >
> > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > >
> > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > >
> > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > >
> > > > > >
> > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > this means only single core will poll only, what will happen if there are
> > > > > > multiple cores polling, won't it cause problem?
> > > > > >
> > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > the solution, can't this work in a transparent way to the application?
> > > > >
> > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > in same group into one new port. Users could schedule polling on the
> > > > > aggregated port instead of all member ports.
> > > >
> > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > feature, we should not change fastpath of testpmd
> > > > application. Instead, testpmd can use aggregated ports probably as
> > > > separate fwd_engine to show how to use this feature.
> > >
> > > Good point to discuss :) There are two strategies to polling a shared
> > > Rxq:
> > > 1. polling each member port
> > >    All forwarding engines can be reused to work as before.
> > >    My testpmd patches are efforts towards this direction.
> > >    Does your PMD support this?
> >
> > Not unfortunately. More than that, every application needs to change
> > to support this model.
>
> Both strategies need user application to resolve port ID from mbuf and
> process accordingly.
> This one doesn't demand aggregated port, no polling schedule change.

I was thinking, mbuf will be updated from driver/aggregator port as when it
comes to application.

>
> >
> > > 2. polling aggregated port
> > >    Besides forwarding engine, need more work to to demo it.
> > >    This is an optional API, not supported by my PMD yet.
> >
> > We are thinking of implementing this PMD when it comes to it, ie.
> > without application change in fastpath
> > logic.
>
> Fastpath have to resolve port ID anyway and forwarding according to
> logic. Forwarding engine need to adapt to support shard Rxq.
> Fortunately, in testpmd, this can be done with an abstract API.
>
> Let's defer part 2 until some PMD really support it and tested, how do
> you think?

We are not planning to use this feature so either way it is OK to me.
I leave to ethdev maintainers decide between 1 vs 2.

I do have a strong opinion not changing the testpmd basic forward engines
for this feature.I would like to keep it simple as fastpath optimized and would
like to add a separate Forwarding engine as means to verify this feature.



>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > complexity it brings?
> > > > > >
> > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > scope.
> > > > > > >
> > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > +
> > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > +---------------
> > > > > > > > > > > +
> > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > +
> > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > > +
> > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > >
> > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > >  MTU update           =
> > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > >
> > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > `_
> > > > > > > > > > >
> > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > +grows,
> > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > +miss and
> > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > +PF and
> > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > +the
> > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > +enable
> > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > +return
> > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > >  ------------
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > >  };
> > > > > > > > > > >
> > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > + switch domain. */
> > > > > > > > > > >         /**
> > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > +memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > >
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
> > > > > >
> > > > >
> > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:38                       ` Jerin Jacob
@ 2021-09-28 13:59                         ` Ananyev, Konstantin
  2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:51                         ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-28 13:59 UTC (permalink / raw)
  To: Jerin Jacob, Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh

> 
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > >
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > >
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > >
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > >
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > >
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > >
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > >
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > >
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into
> the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > >
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic
> come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > >
> > > > > > >
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > >
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > >
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > >
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > >
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > >    All forwarding engines can be reused to work as before.
> > > >    My testpmd patches are efforts towards this direction.
> > > >    Does your PMD support this?
> > >
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> >
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
> 
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
> 
> >
> > >
> > > > 2. polling aggregated port
> > > >    Besides forwarding engine, need more work to to demo it.
> > > >    This is an optional API, not supported by my PMD yet.
> > >
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> >
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> >
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
> 
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.
> 
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.

+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.

> 
> 
> 
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > >
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its
> name and
> > > > > > > > > scope.
> > > > > > > >
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > >
> > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > >
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > >
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > >  ------------
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > >  };
> > > > > > > > > > > >
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > >
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-27 23:53     ` Ajit Khaparde
@ 2021-09-28 14:24       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:24 UTC (permalink / raw)
  To: ajit.khaparde
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev,
	Slava Ovsiienko, ferruh.yigit, Lior Margalit

On Mon, 2021-09-27 at 16:53 -0700, Ajit Khaparde wrote:
> On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > 
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduces shared RX queue. Ports with same configuration in
> > a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> > 
> > Port queue number in a shared group should be identical. Queue index is
> > 1:1 mapped in shared group.
> > 
> > Share RX queue must be polled on single thread or core.
> > 
> > Multiple groups is supported by group ID.
> Can you clarify this a little more?

Thanks for the review!

By using group ID, user can specify for example:
 group 0: port 0-3, 2 queues per port, poll on core 0 and 1
 group 1: port 4-127, 1 queue per port, poll on core 1.
Normally used for QoS and load balance.

> 
> Apologies if this was already covered:
> * Can't we do this for Tx also?

Rx queue pre-fill mbufs for each queue, which consuming huge mbufs by
default, most of them are less active, saving memory is the primary
motatation for this feature.
Tx queue doesn't consume any mbuf by default until starting tx, no
strong reason so far. 

> 
> Couple of nits inline. Thanks
> 
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > ---
> > Rx queue object could be used as shared Rx queue object, it's important
> > to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index a96e12d155..2e2a9b1554 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > 
> > 
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> > 
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index 754184ddd4..ebeb4c1851 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c80..45bf5a3a10 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > 
> > +- Memory usage of representors is huge when number of representor grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > +  Polling the large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> 
> "in the same switch"
> 
> > +  is present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to enable
> > +  shared Rx queue. Polling any member port of shared Rx queue can return
> "of the shared Rx queue.."
> 
> > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> 
> "ports in the group, "
> 
> > +
> >  Basic SR-IOV
> >  ------------
> > 
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index a7c090ce79..b3a58d5e65 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> > 
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +       uint32_t shared_group; /**< Shared port group index in switch domain. */
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> >  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> 
> "Any port in the group can"
> 
> 
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > 
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.33.0
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:59                         ` Ananyev, Konstantin
@ 2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:59                             ` Jerin Jacob
  2021-09-29  0:26                             ` Ananyev, Konstantin
  0 siblings, 2 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:40 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > 
> > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > <xuemingl@nvidia.com> wrote:
> > > 
> > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > Monjalon
> > <thomas@monjalon.net>;
> > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > 
> > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > with same
> > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > index is
> > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > 
> > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > representor?
> > > > > > > > > > > 
> > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > take advantage.
> > > > > > > > > > > 
> > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > 
> > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > The control path of is
> > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > implementation.
> > > > > > > > > > 
> > > > > > > > > > My question was if create a generic
> > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > ethdev receive queues land into
> > the same
> > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > 
> > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > of shared rxq.
> > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > target fs.
> > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > performance if traffic
> > come from
> > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > callback, so it suites for
> > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > (share queue), right? Does
> > > > > > > > this means only single core will poll only, what will
> > > > > > > > happen if there are
> > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > 
> > > > > > > > And if this requires specific changes in the
> > > > > > > > application, I am not sure about
> > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > the application?
> > > > > > > 
> > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > aggregate ports
> > > > > > > in same group into one new port. Users could schedule
> > > > > > > polling on the
> > > > > > > aggregated port instead of all member ports.
> > > > > > 
> > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > For this
> > > > > > feature, we should not change fastpath of testpmd
> > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > probably as
> > > > > > separate fwd_engine to show how to use this feature.
> > > > > 
> > > > > Good point to discuss :) There are two strategies to polling
> > > > > a shared
> > > > > Rxq:
> > > > > 1. polling each member port
> > > > >    All forwarding engines can be reused to work as before.
> > > > >    My testpmd patches are efforts towards this direction.
> > > > >    Does your PMD support this?
> > > > 
> > > > Not unfortunately. More than that, every application needs to
> > > > change
> > > > to support this model.
> > > 
> > > Both strategies need user application to resolve port ID from
> > > mbuf and
> > > process accordingly.
> > > This one doesn't demand aggregated port, no polling schedule
> > > change.
> > 
> > I was thinking, mbuf will be updated from driver/aggregator port as
> > when it
> > comes to application.
> > 
> > > 
> > > > 
> > > > > 2. polling aggregated port
> > > > >    Besides forwarding engine, need more work to to demo it.
> > > > >    This is an optional API, not supported by my PMD yet.
> > > > 
> > > > We are thinking of implementing this PMD when it comes to it,
> > > > ie.
> > > > without application change in fastpath
> > > > logic.
> > > 
> > > Fastpath have to resolve port ID anyway and forwarding according
> > > to
> > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > Fortunately, in testpmd, this can be done with an abstract API.
> > > 
> > > Let's defer part 2 until some PMD really support it and tested,
> > > how do
> > > you think?
> > 
> > We are not planning to use this feature so either way it is OK to
> > me.
> > I leave to ethdev maintainers decide between 1 vs 2.
> > 
> > I do have a strong opinion not changing the testpmd basic forward
> > engines
> > for this feature.I would like to keep it simple as fastpath
> > optimized and would
> > like to add a separate Forwarding engine as means to verify this
> > feature.
> 
> +1 to that.
> I don't think it a 'common' feature.
> So separate FWD mode seems like a best choice to me.

-1 :)
There was some internal requirement from test team, they need to verify
all features like packet content, rss, vlan, checksum, rte_flow... to
be working based on shared rx queue. Based on the patch, I believe the
impact has been minimized.

> 
> > 
> > 
> > 
> > > 
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > represontors? If so can't we
> > > > > > > > have a port representor specific solution, reducing
> > > > > > > > scope can reduce the
> > > > > > > > complexity it brings?
> > > > > > > > 
> > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > representor the case by changing its
> > name and
> > > > > > > > > > scope.
> > > > > > > > > 
> > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > apply.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  doc/guides/nics/features.rst               
> > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > >  doc/guides/nics/features/default.ini       
> > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                    
> > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                    
> > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > checksum.
> > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +* **[uses]    
> > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > +++
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > `_
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > +the
> > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > +enable
> > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > +return
> > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > M),
> > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > IT),
> > > > > > > > > > > > > +      
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > >  };
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > */
> > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > >         /**
> > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 
> > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH            
> > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ  
> > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > 
> > > > > 
> > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:38                       ` Jerin Jacob
  2021-09-28 13:59                         ` Ananyev, Konstantin
@ 2021-09-28 14:51                         ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:51 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 19:08 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > > 
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > 
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > > 
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > 
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > > 
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > > 
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > > 
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > > 
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > > 
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > 
> > > > > > > 
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > 
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > > 
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > > 
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > > 
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > >    All forwarding engines can be reused to work as before.
> > > >    My testpmd patches are efforts towards this direction.
> > > >    Does your PMD support this?
> > > 
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> > 
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
> 
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
> 
> > 
> > > 
> > > > 2. polling aggregated port
> > > >    Besides forwarding engine, need more work to to demo it.
> > > >    This is an optional API, not supported by my PMD yet.
> > > 
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> > 
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> > 
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
> 
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.

A better driver should support both, but specific driver could select
either one. 1 brings less changes to application, 2 brings better
performance with additional steps.

> 
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.
> 
> 
> 
> > 
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > > 
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > > scope.
> > > > > > > > 
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > 
> > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > > 
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > > 
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > >  ------------
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > >  };
> > > > > > > > > > > > 
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > > 
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > 
> > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:40                           ` Xueming(Steven) Li
@ 2021-09-28 14:59                             ` Jerin Jacob
  2021-09-29  7:41                               ` Xueming(Steven) Li
  2021-09-29  0:26                             ` Ananyev, Konstantin
  1 sibling, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 14:59 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: konstantin.ananyev, NBU-Contact-Thomas Monjalon,
	andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > >
> > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > Monjalon
> > > <thomas@monjalon.net>;
> > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > >    All forwarding engines can be reused to work as before.
> > > > > >    My testpmd patches are efforts towards this direction.
> > > > > >    Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > >    This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
>
> -1 :)
> There was some internal requirement from test team, they need to verify

Internal QA requirements may not be the driving factor :-)

> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue. Based on the patch, I believe the
> impact has been minimized.


>
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > >  };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:59                             ` Jerin Jacob
@ 2021-09-29  0:26                             ` Ananyev, Konstantin
  2021-09-29  8:40                               ` Xueming(Steven) Li
  2021-09-29  9:12                               ` Xueming(Steven) Li
  1 sibling, 2 replies; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29  0:26 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh


> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > >    All forwarding engines can be reused to work as before.
> > > > > >    My testpmd patches are efforts towards this direction.
> > > > > >    Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > >    This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
> 
> -1 :)
> There was some internal requirement from test team, they need to verify
> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue.

Then I suppose you'll need to write really comprehensive fwd-engine 
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through different HW queue).
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point to pollute
basic fwd cases with it.

BTW, as a side note, the code below looks bogus to me:
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)	
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);

you access pkt_burst[] beyond array boundaries,
also you ask cpu to prefetch some unknown and possibly invalid address.

> Based on the patch, I believe the
> impact has been minimized.
> 
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > >  };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:59                             ` Jerin Jacob
@ 2021-09-29  7:41                               ` Xueming(Steven) Li
  2021-09-29  8:05                                 ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  7:41 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > 
> > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > Monjalon
> > > > <thomas@monjalon.net>;
> > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to verify
> 
> Internal QA requirements may not be the driving factor :-)

It will be a test requirement for any driver to face, not internal. The
performance difference almost zero in v3, only an "unlikely if" test on
each burst. Shared Rxq is a low level feature, reusing all current FWD
engines to verify driver high level features is important IMHO.

> 
> > all features like packet content, rss, vlan, checksum, rte_flow... to
> > be working based on shared rx queue. Based on the patch, I believe the
> > impact has been minimized.
> 
> 
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  7:41                               ` Xueming(Steven) Li
@ 2021-09-29  8:05                                 ` Jerin Jacob
  2021-10-08  8:26                                   ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-29  8:05 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: Raslan Darawsheh, NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > Monjalon
> > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > >    Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to verify
> >


> > Internal QA requirements may not be the driving factor :-)
>
> It will be a test requirement for any driver to face, not internal. The
> performance difference almost zero in v3, only an "unlikely if" test on
> each burst. Shared Rxq is a low level feature, reusing all current FWD
> engines to verify driver high level features is important IMHO.

In addition to additional if check, The real concern is polluting the
common forward engine for the not common feature.

If you really want to reuse the existing application without any
application change,
I think, you need to hook this to eventdev
http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34

Where eventdev drivers does this thing in addition to other features, Ie.
t has ports (which is kind of aggregator),
it can receive the packets from any queue with mbuf->port as actually
received port.
That is in terms of mapping:
- event queue will be dummy it will be as same as Rx queue
- Rx adapter will be also a dummy
- event ports aggregate multiple queues and connect to core via event port
- On Rxing the packet, mbuf->port will be the actual Port which is received.
app/test-eventdev written to use this model.



>
> >
> > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > be working based on shared rx queue. Based on the patch, I believe the
> > > impact has been minimized.
> >
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  0:26                             ` Ananyev, Konstantin
@ 2021-09-29  8:40                               ` Xueming(Steven) Li
  2021-09-29 10:20                                 ` Ananyev, Konstantin
  2021-09-29  9:12                               ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  8:40 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
> 
> Then I suppose you'll need to write really comprehensive fwd-engine 
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.

The shared Rxq is low level feature, need to make sure driver higher
level features working properly. fwd-engines like csum checks input
packet and enable L3/L4 checksum and tunnel offloads accordingly,
other engines do their own feature verification. All test automation
could be reused with these engines supported seamlessly.

> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).

Yes, each mbuf in burst come from differnt port, testpmd current fwd-
engines relies heavily on source forwarding stream, that's why the
patch devide burst result mbufs into sub-burst and use orginal fwd-
engine callback to handle. How to handle is not changed.

> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.

Shared Rxq is a ethdev feature that impacts how packets get handled.
It's natural to update forwarding engines to avoid broken.
The new macro is introduced to minimize performance impact, I'm also
wondering is there an elegant solution :) Current performance penalty
is one "if unlikely" per burst.

Think in reverse direction, if we don't update fwd-engines here, all
malfunction when shared rxq enabled, users can't verify driver
features, are you expecting this?

> 
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)	
> +{
> +	uint16_t i, nb_fs_rx = 1, port;
> +
> +	/* Locate real source fs according to mbuf->port. */
> +	for (i = 0; i < nb_rx; ++i) {
> +		rte_prefetch0(pkts_burst[i + 1]);
> 
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.
> 
> > Based on the patch, I believe the
> > impact has been minimized.
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  0:26                             ` Ananyev, Konstantin
  2021-09-29  8:40                               ` Xueming(Steven) Li
@ 2021-09-29  9:12                               ` Xueming(Steven) Li
  2021-09-29  9:52                                 ` Ananyev, Konstantin
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  9:12 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
> 
> Then I suppose you'll need to write really comprehensive fwd-engine 
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.
> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).
> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.
> 
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)	
> +{
> +	uint16_t i, nb_fs_rx = 1, port;
> +
> +	/* Locate real source fs according to mbuf->port. */
> +	for (i = 0; i < nb_rx; ++i) {
> +		rte_prefetch0(pkts_burst[i + 1]);
> 
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.

Sorry I forgot this topic. It's too late to prefetch current packet, so
perfetch next is better. Prefetch an invalid address at end of a look
doesn't hurt, it's common in DPDK.  

> 
> > Based on the patch, I believe the
> > impact has been minimized.
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > >