From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A10FBA00C3; Sun, 26 Dec 2021 11:25:35 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 314EC40140; Sun, 26 Dec 2021 11:25:35 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id F3EE74013F for ; Sun, 26 Dec 2021 11:25:33 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Date: Sun, 26 Dec 2021 11:25:26 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DAF@smartserver.smartshare.dk> In-Reply-To: <20211224164613.32569-1-feifei.wang2@arm.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Thread-Index: Adf45cqRWsm3zOPbTBqjsnkimDKFQgBU+6HQ References: <20211224164613.32569-1-feifei.wang2@arm.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Feifei Wang" Cc: , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Feifei Wang [mailto:feifei.wang2@arm.com] > Sent: Friday, 24 December 2021 17.46 >=20 > Currently, the transmit side frees the buffers into the lcore cache = and > the receive side allocates buffers from the lcore cache. The transmit > side typically frees 32 buffers resulting in 32*8=3D256B of stores to > lcore cache. The receive side allocates 32 buffers and stores them in > the receive side software ring, resulting in 32*8=3D256B of stores and > 256B of load from the lcore cache. >=20 > This patch proposes a mechanism to avoid freeing to/allocating from > the lcore cache. i.e. the receive side will free the buffers from > transmit side directly into it's software ring. This will avoid the > 256B > of loads and stores introduced by the lcore cache. It also frees up = the > cache lines used by the lcore cache. >=20 > However, this solution poses several constraint: >=20 > 1)The receive queue needs to know which transmit queue it should take > the buffers from. The application logic decides which transmit port to > use to send out the packets. In many use cases the NIC might have a > single port ([1], [2], [3]), in which case a given transmit queue is > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This > is easy to configure. >=20 > If the NIC has 2 ports (there are several references), then we will > have > 1:2 (RX queue: TX queue) mapping which is still easy to configure. > However, if this is generalized to 'N' ports, the configuration can be > long. More over the PMD would have to scan a list of transmit queues = to > pull the buffers from. I disagree with the description of this constraint. As I understand it, it doesn't matter now many ports or queues are in a = NIC or system. The constraint is more narrow: This patch requires that all packets ingressing on some port/queue must = egress on the specific port/queue that it has been configured to ream = its buffers from. I.e. an application cannot route packets between = multiple ports with this patch. >=20 > 2)The other factor that needs to be considered is 'run-to-completion' > vs > 'pipeline' models. In the run-to-completion model, the receive side = and > the transmit side are running on the same lcore serially. In the > pipeline > model. The receive side and transmit side might be running on = different > lcores in parallel. This requires locking. This is not supported at > this > point. >=20 > 3)Tx and Rx buffers must be from the same mempool. And we also must > ensure Tx buffer free number is equal to Rx buffer free number: > (txq->tx_rs_thresh =3D=3D RTE_I40E_RXQ_REARM_THRESH) > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This > is due to tx_next_dd is a variable to compute tx sw-ring free = location. > Its value will be one more round than the position where next time = free > starts. >=20 You are missing the fourth constraint: 4) The application must transmit all received packets immediately, i.e. = QoS queueing and similar is prohibited. > Current status in this RFC: > 1)An API is added to allow for mapping a TX queue to a RX queue. > Currently it supports 1:1 mapping. > 2)The i40e driver is changed to do the direct re-arm of the receive > side. > 3)L3fwd application is hacked to do the mapping for the following > command: > one core two flows case: > $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1 > -- -p 0x3 -P --config=3D'(0,0,1),(1,0,1)' > where: > Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0 > Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0 >=20 > Testing status: > 1)Tested L3fwd with the above command: > The testing results for L3fwd are as follows: > ------------------------------------------------------------------- > N1SDP: > Base performance(with this patch) with direct re-arm mode enabled > 0% +14.1% >=20 > Ampere Altra: > Base performance(with this patch) with direct re-arm mode enabled > 0% +17.1% > ------------------------------------------------------------------- > This patch can not affect performance of normal mode, and if enable > direct-rearm mode, performance can be improved by 14% - 17% in n1sdp > and ampera-altra. >=20 > Feedback requested: > 1) Has anyone done any similar experiments, any lessons learnt? > 2) Feedback on API >=20 > Next steps: > 1) Update the code for supporting 1:N(Rx : TX) mapping > 2) Automate the configuration in L3fwd sample application >=20 > Reference: > [1] https://store.nvidia.com/en- > us/networking/store/product/MCX623105AN- > = CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/ > [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel- > ethernet-network-adapter-e810cqda1/specifications.html > [3] https://www.broadcom.com/products/ethernet-connectivity/network- > adapters/100gb-nic-ocp/n1100g >=20 > Feifei Wang (4): > net/i40e: enable direct re-arm mode > ethdev: add API for direct re-arm mode > net/i40e: add direct re-arm mode internal API > examples/l3fwd: give an example for direct rearm mode >=20 > drivers/net/i40e/i40e_ethdev.c | 34 ++++++ > drivers/net/i40e/i40e_rxtx.h | 4 + > drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 = +++++++++++++++++++++++++- > examples/l3fwd/main.c | 3 + > lib/ethdev/ethdev_driver.h | 15 +++ > lib/ethdev/rte_ethdev.c | 14 +++ > lib/ethdev/rte_ethdev.h | 31 ++++++ > lib/ethdev/version.map | 3 + > 8 files changed, 251 insertions(+), 2 deletions(-) >=20 > -- > 2.25.1 >=20 The patch provides a significant performance improvement, but I am = wondering if any real world applications exist that would use this. Only = a "router on a stick" (i.e. a single-port router) comes to my mind, and = that is probably sufficient to call it useful in the real world. Do you = have any other examples to support the usefulness of this patch? Anyway, the patch doesn't do any harm if unused, and the only = performance cost is the "if (rxq->direct_rxrearm_enable)" branch in the = Ethdev driver. So I don't oppose to it.