From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 41EF943D8E; Wed, 10 Apr 2024 12:35:11 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C4E34402C7; Wed, 10 Apr 2024 12:35:10 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 37AE2402C5; Wed, 10 Apr 2024 12:35:09 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4VDzcD47BRz6K93W; Wed, 10 Apr 2024 18:30:20 +0800 (CST) Received: from frapeml500005.china.huawei.com (unknown [7.182.85.13]) by mail.maildlp.com (Postfix) with ESMTPS id CBFE01400CA; Wed, 10 Apr 2024 18:35:07 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500005.china.huawei.com (7.182.85.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Wed, 10 Apr 2024 12:35:07 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.035; Wed, 10 Apr 2024 12:35:07 +0200 From: Konstantin Ananyev To: =?iso-8859-1?Q?Morten_Br=F8rup?= , "David Marchand" , "dev@dpdk.org" CC: "thomas@monjalon.net" , "ferruh.yigit@amd.com" , "stable@dpdk.org" , Olivier Matz , Jijiang Liu , "Andrew Rybchenko" , Ferruh Yigit , Kaiwen Deng , "qiming.yang@intel.com" , "yidingx.zhou@intel.com" , Aman Singh , "Yuying Zhang" , Thomas Monjalon , "Jerin Jacob" Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples Thread-Topic: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples Thread-Index: AQHah2gcwh+zxBWPrEOmSW2nVSxrXbFZuegAgAX4BtCAAE6MMIABTCyw Date: Wed, 10 Apr 2024 10:35:07 +0000 Message-ID: <409157f5da3e4c628ca678dd9e2c7957@huawei.com> References: <20240405125039.897933-1-david.marchand@redhat.com> <20240405144604.906695-1-david.marchand@redhat.com> <20240405144604.906695-4-david.marchand@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35E9F36C@smartserver.smartshare.dk> <10b564b42f8d4db387f6302701f24ce3@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35E9F381@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F381@smartserver.smartshare.dk> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.48.149.235] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com] > > Sent: Tuesday, 9 April 2024 15.39 > > > > > > From: David Marchand [mailto:david.marchand@redhat.com] > > > > Sent: Friday, 5 April 2024 16.46 > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload > > > > examples. > > > > > > I strongly disagree with this change! > > > > > > It will cause a huge performance degradation for shaping applications= : > > > > > > A packet will be processed and finalized at an output or forwarding > > pipeline stage, where some other fields might also be written, so > > > zeroing e.g. the out_ip checksum at this stage has low cost (no new > > cache misses). > > > > > > Then, the packet might be queued for QoS or similar. > > > > > > If rte_eth_tx_prepare() must be called at the egress pipeline stage, > > it has to write to the packet and cause a cache miss per packet, > > > instead of simply passing on the packet to the NIC hardware. > > > > > > It must be possible to finalize the packet at the output/forwarding > > pipeline stage! > > > > If you can finalize your packet on output/forwarding, then why you > > can't invoke tx_prepare() on the same stage? > > There seems to be some misunderstanding about what tx_prepare() does - > > in fact it doesn't communicate with HW queue (doesn't update TXD ring, > > etc.), what it does - just make changes in mbuf itself. > > Yes, it reads some fields in SW TX queue struct (max number of TXDs per > > packet, etc.), but AFAIK it is safe > > to call tx_prepare() and tx_burst() from different threads. > > At least on implementations I am aware about. > > Just checked the docs - it seems not stated explicitly anywhere, might > > be that's why it causing such misunderstanding. > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets > > egressing on different NIC hardware? > > > > If you create a clone of full packet (including L2/L3) headers then > > obviously such construction might not > > work properly with tx_prepare() over two different NICs. > > Though In majority of cases you do clone segments with data, while at > > least L2 headers are put into different segments. > > One simple approach would be to keep L3 header in that separate segment= . > > But yes, there is a problem when you'll need to send exactly the same > > packet over different NICs. > > As I remember, for bonding PMD things don't work quite well here - you > > might have a bond over 2 NICs with > > different tx_prepare() and which one to call might be not clear till > > actual PMD tx_burst() is invoked. > > > > > > > > In theory, it might get even worse if we make this opaque instead of > > transparent and standardized: > > > One PMD might reset out_ip checksum to 0x0000, and another PMD might > > reset it to 0xFFFF. > > > > > > > > I can only see one solution: > > > We need to standardize on common minimum requirements for how to > > prepare packets for each TX offload. > > > > If we can make each and every vendor to agree here - that definitely > > will help to simplify things quite a bit. >=20 > An API is more than a function name and parameters. > It also has preconditions and postconditions. >=20 > All major NIC vendors are contributing to DPDK. > It should be possible to reach consensus for reasonable minimum requireme= nts for offloads. > Hardware- and driver-specific exceptions can be documented with the offlo= ad flag, or with rte_eth_rx/tx_burst(), like the note to > rte_eth_rx_burst(): > "Some drivers using vector instructions require that nb_pkts is divisible= by 4 or 8, depending on the driver implementation." If we introduce a rule that everyone supposed to follow and then straightwa= y allow people to have a 'documented exceptions', for me it means like 'no rule' in practice. A 'documented exceptions' approach might work if you have 5 different PMDs = to support, but not when you have 50+. No-one would write an app with possible 10 different exception cases in his= head. Again, with such approach we can forget about backward compatibility. I think we already had this discussion before, my opinion remains the same = here - 'documented exceptions' approach is a way to trouble. =20 > You mention the bonding driver, which is a good example. > The rte_eth_tx_burst() documentation has a note about the API postconditi= on exception for the bonding driver: > "This function must not modify mbufs (including packets data) unless the = refcnt is 1. An exception is the bonding PMD, [...], mbufs > may be modified." For me, what we've done for bonding tx_prepare/tx_burst() is a really bad e= xample. Initial agreement and design choice was that tx_burst() should not modify c= ontents of the packets (that actually was one of the reasons why tx_prepare() was introduced). The only reason I agreed on that exception - because I couldn't come-up wit= h something less uglier. Actually, these problems with bonding PMD made me to start thinking that cu= rrent tx_prepare/tx_burst approach might need to be reconsidered somehow.=20 > > Then we can probably have one common tx_prepare() for all vendors ;) >=20 > Yes, that would be the goal. > More realistically, the ethdev layer could perform the common checks, and= only the non-conforming drivers would have to implement > their specific tweaks. =20 Hmm, but that's what we have right now:=20 - fields in mbuf and packet data that user has to fill correctly and dev sp= ecific tx_prepare().=20 How what you suggest will differ then?=20 And how it will help let say with bonding PMD situation, or with TX-ing of = the same packet over 2 different NICs? > If we don't standardize the meaning of the offload flags, the application= developers cannot trust them! > I'm afraid this is the current situation - application developers either = test with specific NIC hardware, or don't use the offload features. Well, I have used TX offloads through several projects, it worked quite wel= l. Though have to admit, never have to use TX offloads together with our bondi= ng PMD.=20