From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 37CAA42809; Wed, 22 Mar 2023 15:04:28 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1357440E09; Wed, 22 Mar 2023 15:04:28 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 2C44040A84 for ; Wed, 22 Mar 2023 15:04:27 +0100 (CET) Content-class: urn:content-classes:message Subject: RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Wed, 22 Mar 2023 15:04:24 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D877E4@smartserver.smartshare.dk> X-MimeOLE: Produced By Microsoft Exchange V6.5 In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v3 0/3] Direct re-arming of buffers on receive side Thread-Index: AdkgDntIx/8vWsFOS9CajMKGnsFCcw8rPrkgAAIIkvAAAIG3YA== References: <20220420081650.2043183-1-feifei.wang2@arm.com> <20230104073043.1120168-1-feifei.wang2@arm.com> <98CBD80474FA8B44BF855DF32C47DC35D877E3@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Honnappa Nagarahalli" , "Feifei Wang" Cc: , , "nd" , , "Yuying Zhang" , "Beilei Xing" , "Ruifeng Wang" , "nd" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Wednesday, 22 March 2023 14.42 >=20 > > From: Morten Br=F8rup > > Sent: Wednesday, March 22, 2023 7:57 AM > > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com] > > > Sent: Wednesday, 4 January 2023 08.31 > > > > > > Currently, the transmit side frees the buffers into the lcore = cache > > > and the receive side allocates buffers from the lcore cache. The > > > transmit side typically frees 32 buffers resulting in 32*8=3D256B = of > > > stores to lcore cache. The receive side allocates 32 buffers and > > > stores them in the receive side software ring, resulting in = 32*8=3D256B > > > of stores and 256B of load from the lcore cache. > > > > > > This patch proposes a mechanism to avoid freeing to/allocating = from > > > the lcore cache. i.e. the receive side will free the buffers from > > > transmit side directly into its software ring. This will avoid the > > > 256B of loads and stores introduced by the lcore cache. It also = frees > > > up the cache lines used by the lcore cache. > > > > I am starting to wonder if we have been adding unnecessary feature = creep in > > order to make this feature too generic. > Can you please elaborate on the feature creep you are thinking? The = features > have been the same since the first implementation, but it is made more > generic. Maybe not "features" as such; but the API has evolved, and perhaps we = could simplify both the API and the implementation if we narrowed the = scope. I'm not saying that what we have is bad or too complex; I'm only asking = to consider if there are opportunities to take a step back and simplify = some things. >=20 > > > > Could you please describe some of the most important high-volume use = cases > > from real life? It would help setting the scope correctly. > The use cases have been discussed several times already. Yes, but they should be mentioned in the patch cover letter - and later = on in the documentation. It will help limiting the scope while developing this feature. And it = will make it easier for application developers to relate to the feature = and determine if it is relevant for their application. >=20 > > > > > > > > However, this solution poses several constraints: > > > > > > 1)The receive queue needs to know which transmit queue it should = take > > > the buffers from. The application logic decides which transmit = port to > > > use to send out the packets. In many use cases the NIC might have = a > > > single port ([1], [2], [3]), in which case a given transmit queue = is > > > always mapped to a single receive queue (1:1 Rx queue: Tx queue). = This > > > is easy to configure. > > > > > > If the NIC has 2 ports (there are several references), then we = will > > > have > > > 1:2 (RX queue: TX queue) mapping which is still easy to configure. > > > However, if this is generalized to 'N' ports, the configuration = can be > > > long. More over the PMD would have to scan a list of transmit = queues > > > to pull the buffers from. > > > > > > 2)The other factor that needs to be considered is = 'run-to-completion' > > > vs 'pipeline' models. In the run-to-completion model, the receive = side > > > and the transmit side are running on the same lcore serially. In = the > > > pipeline model. The receive side and transmit side might be = running on > > > different lcores in parallel. This requires locking. This is not > > > supported at this point. > > > > > > 3)Tx and Rx buffers must be from the same mempool. And we also = must > > > ensure Tx buffer free number is equal to Rx buffer free number. > > > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. = This > > > is due to tx_next_dd is a variable to compute tx sw-ring free = location. > > > Its value will be one more round than the position where next time > > > free starts. > > > > > > Current status in this patch: > > > 1)Two APIs are added for users to enable direct-rearm mode: > > > In control plane, users can call = 'rte_eth_rx_queue_rearm_data_get' > > > to get Rx sw_ring pointer and its rxq_info. > > > (This avoid Tx load Rx data directly); > > > > > > In data plane, users can call 'rte_eth_dev_direct_rearm' to = rearm Rx > > > buffers and free Tx buffers at the same time. Specifically, in = this > > > API, there are two separated API for Rx and Tx. > > > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx > > > freed buffers. > > > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors = based > > > on the rearm buffers. > > > Thus, this can separate Rx and Tx operation, and user can even = re-arm > > > RX queue not from the same driver's TX queue, but from different > > > sources too. > > > = ----------------------------------------------------------------------- > > > control plane: > > > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data); > > > data plane: > > > loop { > > > rte_eth_dev_direct_rearm(*rxq_rearm_data){ > > > > > > rte_eth_tx_fill_sw_ring{ > > > for (i =3D 0; i <=3D 32; i++) { > > > sw_ring.mbuf[i] =3D tx.mbuf[i]; > > > } > > > } > > > > > > rte_eth_rx_flush_descriptor{ > > > for (i =3D 0; i <=3D 32; i++) { > > > flush descs[i]; > > > } > > > } > > > } > > > rte_eth_rx_burst; > > > rte_eth_tx_burst; > > > } > > > = ---------------------------------------------------------------------- > > > - 2)The i40e driver is changed to do the direct re-arm of the = receive > > > side. > > > 3)The ixgbe driver is changed to do the direct re-arm of the = receive > > > side. > > > > > > Testing status: > > > (1) dpdk l3fwd test with multiple drivers: > > > port 0: 82599 NIC port 1: XL710 NIC > > > ------------------------------------------------------------- > > > Without fast free With fast free > > > Thunderx2: +9.44% +7.14% > > > ------------------------------------------------------------- > > > > > > (2) dpdk l3fwd test with same driver: > > > port 0 && 1: XL710 NIC > > > ------------------------------------------------------------- > > > *Direct rearm with exposing rx_sw_ring: > > > Without fast free With fast free > > > Ampere altra: +14.98% +15.77% > > > n1sdp: +6.47% +0.52% > > > ------------------------------------------------------------- > > > > > > (3) VPP test with same driver: > > > port 0 && 1: XL710 NIC > > > ------------------------------------------------------------- > > > *Direct rearm with exposing rx_sw_ring: > > > Ampere altra: +4.59% > > > n1sdp: +5.4% > > > ------------------------------------------------------------- > > > > > > Reference: > > > [1] > > > = https://store.nvidia.com/en-us/networking/store/product/MCX623105AN- > > > > > CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypto > > Disabled > > > / [2] > > > https://www.intel.com/content/www/us/en/products/sku/192561/intel- > > > ethernet-network-adapter-e810cqda1/specifications.html > > > [3] = https://www.broadcom.com/products/ethernet-connectivity/network- > > > adapters/100gb-nic-ocp/n1100g > > > > > > V2: > > > 1. Use data-plane API to enable direct-rearm (Konstantin, = Honnappa) 2. > > > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use > > > input parameter to enable direct rearm in l3fwd (Konstantin) 4. = Add > > > condition detection for direct rearm API (Morten, Andrew = Rybchenko) > > > > > > V3: > > > 1. Seperate Rx and Tx operation with two APIs in direct-rearm > > > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. = enable > > > direct rearm in ixgbe driver in Arm > > > > > > Feifei Wang (3): > > > ethdev: enable direct rearm with separate API > > > net/i40e: enable direct rearm with separate API > > > net/ixgbe: enable direct rearm with separate API > > > > > > drivers/net/i40e/i40e_ethdev.c | 1 + > > > drivers/net/i40e/i40e_ethdev.h | 2 + > > > drivers/net/i40e/i40e_rxtx.c | 19 +++ > > > drivers/net/i40e/i40e_rxtx.h | 4 + > > > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++ > > > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++ > > > drivers/net/ixgbe/ixgbe_ethdev.c | 1 + > > > drivers/net/ixgbe/ixgbe_ethdev.h | 3 + > > > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++ > > > drivers/net/ixgbe/ixgbe_rxtx.h | 4 + > > > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++ > > > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++ > > > lib/ethdev/ethdev_driver.h | 10 ++ > > > lib/ethdev/ethdev_private.c | 2 + > > > lib/ethdev/rte_ethdev.c | 52 +++++++ > > > lib/ethdev/rte_ethdev.h | 174 = ++++++++++++++++++++++ > > > lib/ethdev/rte_ethdev_core.h | 11 ++ > > > lib/ethdev/version.map | 6 + > > > 18 files changed, 504 insertions(+) > > > > > > -- > > > 2.25.1 > > > >=20