From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5AC5B42B98; Thu, 25 May 2023 11:45:51 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DD86D40DF8; Thu, 25 May 2023 11:45:50 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 31A7C40DDB for ; Thu, 25 May 2023 11:45:49 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A3B161042; Thu, 25 May 2023 02:46:33 -0700 (PDT) Received: from net-x86-dell-8268.shanghai.arm.com (net-x86-dell-8268.shanghai.arm.com [10.169.210.116]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 416323F67D; Thu, 25 May 2023 02:45:47 -0700 (PDT) From: Feifei Wang To: Cc: dev@dpdk.org, nd@arm.com, Feifei Wang Subject: [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Date: Thu, 25 May 2023 17:45:37 +0800 Message-Id: <20230525094541.331338-1-feifei.wang2@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20211224164613.32569-1-feifei.wang2@arm.com> References: <20211224164613.32569-1-feifei.wang2@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Currently, the transmit side frees the buffers into the lcore cache and the receive side allocates buffers from the lcore cache. The transmit side typically frees 32 buffers resulting in 32*8=256B of stores to lcore cache. The receive side allocates 32 buffers and stores them in the receive side software ring, resulting in 32*8=256B of stores and 256B of load from the lcore cache. This patch proposes a mechanism to avoid freeing to/allocating from the lcore cache. i.e. the receive side will free the buffers from transmit side directly into its software ring. This will avoid the 256B of loads and stores introduced by the lcore cache. It also frees up the cache lines used by the lcore cache. And we can call this mode as mbufs recycle mode. In the latest version, mbufs recycle mode is packaged as a separate API. This allows for the users to change rxq/txq pairing in real time in data plane, according to the analysis of the packet flow by the application, for example: ----------------------------------------------------------------------- Step 1: upper application analyse the flow direction Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid) Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info); Step 4: rte_eth_rx_burst(rx_portid,rx_queueid); Step 5: rte_eth_tx_burst(tx_portid,tx_queueid); ----------------------------------------------------------------------- Above can support user to change rxq/txq pairing at run-time and user does not need to know the direction of flow in advance. This can effectively expand mbufs recycle mode's use scenarios. Furthermore, mbufs recycle mode is no longer limited to the same pmd, it can support moving mbufs between different vendor pmds, even can put the mbufs anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided. In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement by mbufs recycle mode. Difference between mbuf recycle, ZC API used in mempool and general path For general path: Rx: 32 pkts memcpy from mempool cache to rx_sw_ring Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache For ZC API used in mempool: Rx: 32 pkts memcpy from mempool cache to rx_sw_ring Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/ For mbufs recycle: Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy; Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop. So, mbufs recycle has its own benefits. Testing status: (1) dpdk l3fwd test with multiple drivers: port 0: 82599 NIC port 1: XL710 NIC ------------------------------------------------------------- Without fast free With fast free Thunderx2: +7.53% +13.54% ------------------------------------------------------------- (2) dpdk l3fwd test with same driver: port 0 && 1: XL710 NIC ------------------------------------------------------------- Without fast free With fast free Ampere altra: +12.61% +11.42% n1sdp: +8.30% +3.85% x86-sse: +8.43% +3.72% ------------------------------------------------------------- (3) Performance comparison with ZC_mempool used port 0 && 1: XL710 NIC with fast free ------------------------------------------------------------- With recycle buffer With zc_mempool Ampere altra: 11.42% 3.54% ------------------------------------------------------------- Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has I/O bottleneck in testpmd in ampere altra, we can not see throughput change compared with I/O fwd engine. However, using record cmd in testpmd: '$set record-burst-stats on' we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This indicate mbufs recycle can save CPU cycles. V2: 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko) V3: 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable direct rearm in ixgbe driver in Arm v4: 1. Rename direct-rearm as buffer recycle. Based on this, function name and variable name are changed to let this mode more general for all drivers. (Konstantin, Morten) 2. Add ring wrapping check (Konstantin) v5: 1. some change for ethdev API (Morten) 2. add support for avx2, sse, altivec path v6: 1. fix ixgbe build issue in ppc 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill' API wrapper (Tech Board meeting) 3. add recycle_mbufs engine in testpmd (Tech Board meeting) 4. add namespace in the functions related to mbufs recycle(Ferruh) Feifei Wang (4): ethdev: add API for mbufs recycle mode net/i40e: implement mbufs recycle mode net/ixgbe: implement mbufs recycle mode app/testpmd: add recycle mbufs engine app/test-pmd/meson.build | 1 + app/test-pmd/recycle_mbufs.c | 79 ++++++++ app/test-pmd/testpmd.c | 1 + app/test-pmd/testpmd.h | 3 + doc/guides/rel_notes/release_23_07.rst | 7 + doc/guides/testpmd_app_ug/run_app.rst | 1 + doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_ethdev.h | 2 + .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140 ++++++++++++++ drivers/net/i40e/i40e_rxtx.c | 32 +++ drivers/net/i40e/i40e_rxtx.h | 4 + drivers/net/i40e/meson.build | 2 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_ethdev.h | 3 + .../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 136 +++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.c | 29 +++ drivers/net/ixgbe/ixgbe_rxtx.h | 4 + drivers/net/ixgbe/meson.build | 2 + lib/ethdev/ethdev_driver.h | 10 + lib/ethdev/ethdev_private.c | 2 + lib/ethdev/rte_ethdev.c | 31 +++ lib/ethdev/rte_ethdev.h | 182 ++++++++++++++++++ lib/ethdev/rte_ethdev_core.h | 15 +- lib/ethdev/version.map | 4 + 25 files changed, 694 insertions(+), 3 deletions(-) create mode 100644 app/test-pmd/recycle_mbufs.c create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c -- 2.25.1