From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B8C2043E4C for ; Fri, 12 Apr 2024 17:54:47 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A9F434068E; Fri, 12 Apr 2024 17:54:47 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id E7C7C402A8; Fri, 12 Apr 2024 17:54:44 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id B1A712073C; Fri, 12 Apr 2024 17:54:44 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples Date: Fri, 12 Apr 2024 17:54:41 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F3A0@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples Thread-Index: AQHah2gcwh+zxBWPrEOmSW2nVSxrXbFZuegAgAX4BtCAAE6MMIABTCywgAAbf0CAAzI3UIAAC0aQgAAit/CAAAXFUA== References: <20240405125039.897933-1-david.marchand@redhat.com> <20240405144604.906695-1-david.marchand@redhat.com> <20240405144604.906695-4-david.marchand@redhat.com> <98CBD80474FA8B44BF855DF32C47DC35E9F36C@smartserver.smartshare.dk> <10b564b42f8d4db387f6302701f24ce3@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35E9F381@smartserver.smartshare.dk> <409157f5da3e4c628ca678dd9e2c7957@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35E9F38F@smartserver.smartshare.dk> <52850a78c83445548a0b78bfd04e6f91@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35E9F39F@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Konstantin Ananyev" , "David Marchand" , Cc: , , , "Olivier Matz" , "Jijiang Liu" , "Andrew Rybchenko" , "Ferruh Yigit" , "Kaiwen Deng" , , , "Aman Singh" , "Yuying Zhang" , "Thomas Monjalon" , "Jerin Jacob" X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx > checksum > > > offload > > > > > > > > > examples. > > > > > > > > > > > > > > > > I strongly disagree with this change! > > > > > > > > > > > > > > > > It will cause a huge performance degradation for shaping > > > applications: > > > > > > > > > > > > > > > > A packet will be processed and finalized at an output or > > > forwarding > > > > > > > pipeline stage, where some other fields might also be > written, > > > so > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low > cost > > > (no new > > > > > > > cache misses). > > > > > > > > > > > > > > > > Then, the packet might be queued for QoS or similar. > > > > > > > > > > > > > > > > If rte_eth_tx_prepare() must be called at the egress > pipeline > > > stage, > > > > > > > it has to write to the packet and cause a cache miss per > packet, > > > > > > > > instead of simply passing on the packet to the NIC > hardware. > > > > > > > > > > > > > > > > It must be possible to finalize the packet at the > > > output/forwarding > > > > > > > pipeline stage! > > > > > > > > > > > > > > If you can finalize your packet on output/forwarding, = then > why > > > you > > > > > > > can't invoke tx_prepare() on the same stage? > > > > > > > There seems to be some misunderstanding about what > tx_prepare() > > > does - > > > > > > > in fact it doesn't communicate with HW queue (doesn't = update > TXD > > > ring, > > > > > > > etc.), what it does - just make changes in mbuf itself. > > > > > > > Yes, it reads some fields in SW TX queue struct (max = number > of > > > TXDs per > > > > > > > packet, etc.), but AFAIK it is safe > > > > > > > to call tx_prepare() and tx_burst() from different = threads. > > > > > > > At least on implementations I am aware about. > > > > > > > Just checked the docs - it seems not stated explicitly > anywhere, > > > might > > > > > > > be that's why it causing such misunderstanding. > > > > > > > > > > > > > > > > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for > cloned > > > packets > > > > > > > egressing on different NIC hardware? > > > > > > > > > > > > > > If you create a clone of full packet (including L2/L3) > headers > > > then > > > > > > > obviously such construction might not > > > > > > > work properly with tx_prepare() over two different NICs. > > > > > > > Though In majority of cases you do clone segments with = data, > > > while at > > > > > > > least L2 headers are put into different segments. > > > > > > > One simple approach would be to keep L3 header in that > separate > > > segment. > > > > > > > But yes, there is a problem when you'll need to send = exactly > the > > > same > > > > > > > packet over different NICs. > > > > > > > As I remember, for bonding PMD things don't work quite = well > here > > > - you > > > > > > > might have a bond over 2 NICs with > > > > > > > different tx_prepare() and which one to call might be not > clear > > > till > > > > > > > actual PMD tx_burst() is invoked. > > > > > > > > > > > > > > > > > > > > > > > In theory, it might get even worse if we make this = opaque > > > instead of > > > > > > > transparent and standardized: > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and = another > PMD > > > might > > > > > > > reset it to 0xFFFF. > > > > > > > > > > > > > > > > > > > > > > > I can only see one solution: > > > > > > > > We need to standardize on common minimum requirements = for > how > > > to > > > > > > > prepare packets for each TX offload. > > > > > > > > > > > > > > If we can make each and every vendor to agree here - that > > > definitely > > > > > > > will help to simplify things quite a bit. > > > > > > > > > > > > An API is more than a function name and parameters. > > > > > > It also has preconditions and postconditions. > > > > > > > > > > > > All major NIC vendors are contributing to DPDK. > > > > > > It should be possible to reach consensus for reasonable > minimum > > > requirements > > > > > for offloads. > > > > > > Hardware- and driver-specific exceptions can be documented > with > > > the offload > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to > > > > > > rte_eth_rx_burst(): > > > > > > "Some drivers using vector instructions require that nb_pkts > is > > > divisible by > > > > > 4 or 8, depending on the driver implementation." > > > > > > > > > > If we introduce a rule that everyone supposed to follow and = then > > > straightway > > > > > allow people to have a 'documented exceptions', > > > > > for me it means like 'no rule' in practice. > > > > > A 'documented exceptions' approach might work if you have 5 > > > different PMDs to > > > > > support, but not when you have 50+. > > > > > No-one would write an app with possible 10 different exception > cases > > > in his > > > > > head. > > > > > Again, with such approach we can forget about backward > > > compatibility. > > > > > I think we already had this discussion before, my opinion > remains > > > the same > > > > > here - > > > > > 'documented exceptions' approach is a way to trouble. > > > > > > > > The "minimum requirements" should be the lowest common = denominator > of > > > all NICs. > > > > Exceptions should be extremely few, for outlier NICs that still > want > > > to provide an offload and its driver is unable to live up to the > > > > minimum requirements. > > > > Any exception should require techboard approval. If a NIC/driver > does > > > not support the "minimum requirements" for an offload > > > > feature, it is not allowed to claim support for that offload > feature, > > > or needs to seek approval for an exception. > > > > > > > > As another option for NICs not supporting the minimum = requirements > of > > > an offload feature, we could introduce offload flags with > > > > finer granularity. E.g. one offload flag for "gold standard" TX > > > checksum update (where the packet's checksum field can have any > > > > value), and another offload flag for "silver standard" TX = checksum > > > update (where the packet's checksum field must have a > > > > precomputed value). > > > > > > Actually yes, I was thinking in the same direction - we need some > extra > > > API to allow user to distinguish. > > > Probably we can do something like that: a new API for the ethdev > call > > > that would take as a parameter > > > TX offloads bitmap and in return specify would it need to modify > > > contents of packet to support these > > > offloads or not. > > > Something like: > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads) > > > > > > For the majority of the drivers that satisfy these "minimum > > > requirements" corresponding devops > > > entry will be empty and we'll always return 0, otherwise PMD has = to > > > provide a proper devop. > > > Then again, it would be up to the user, to determine can he pass > same > > > packet to 2 different NICs or not. > > > > > > I suppose it is similar to what you were talking about? > > > > I was thinking something more simple: > > > > The NIC exposes its RX and TX offload capabilities to the = application > through the rx/tx_offload_capa and other fields in the > > rte_eth_dev_info structure returned by rte_eth_dev_info_get(). > > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM = flag > set. > > These capability flags (or enums) are mostly undocumented in the = code, > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM > > capability means that the NIC is able to update the IPv4 header > checksum at egress (on the wire, i.e. without modifying the mbuf or > > packet data), and that the application must set = RTE_MBUF_F_TX_IP_CKSUM > in the mbufs to utilize this offload. > > I would define and document what each capability flag/enum exactly > means, the minimum requirements (as defined by the DPDK > > community) for the driver to claim support for it, and the > requirements for an application to use it. > > For the sake of discussion, let's say that > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update > capability > > (i.e. no requirements to the checksum field in the packet contents). > > If some NIC requires the checksum field in the packet contents to = have > a precomputed value, the NIC would not be allowed to claim > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability. > > Such a NIC would need to define and document a new capability, e.g. > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver > > standard" TX checksum update capability. > > In other words: I would encode variations of offload capabilities > directly in the capabilities flags. > > Then we don't need additional APIs to help interpret those > capabilities. >=20 > I understood your intention with different flags, yes it should work = too > I think. > The reason I am not very fond of it - it will require to double > TX_OFFLOAD flags. An additional feature flag is only required if a NIC is not conforming = to the "minimum requirements" of an offload feature, and the techboard = permits introducing a variant of an existing feature. There should be very few additional feature flags for variants - = exceptions only - or the "minimum requirements" are not broad enough to = support the majority of NICs. >=20 > > This way, the application can probe the NIC capabilities to = determine > what can be offloaded, and how to do it. > > > > The application can be designed to: > > 1. use a common packet processing pipeline, utilizing only the = lowest > common capabilities denominator of all detected NICs, or > > 2. use a packet processing pipeline, handling packets differently > according to the capabilities of the involved NICs. > > > > NB: There may be other variations than requiring packet contents to = be > modified, and they might be granular. > > E.g. a NIC might require assistance for TCP/UDP checksum offload, = but > not for IP checksum offload, so a function telling if packet > > contents requires modification would not suffice. >=20 > Why not? > If user plans to use multiple tx offloads provide a bitmask of all of > them as an argument. > Let say for both L3 and L4 cksum offloads it will be something like: > (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM | > RTE_ETH_TX_OFFLOAD_TCP_CKSUM) You are partially right; the offload flags can be tested one by one to = determine which header fields in the packet contents need to be updated. I'm assuming the suggested function returns a Boolean; so it doesn't = tell the application how it should modify the packet contents, e.g. if = the header's checksum field must be zeroed or contain the precomputed = checksum of the pseudo header. Alternatively, if it returns a bitfield with flags for different types = of modification, the information could be encoded that way. But then = they might as well be returned as offload capability variation flags in = the rte_eth_dev_info structure's tx_offload_capa field, as I'm = advocating for. >=20 > > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the > rte_eth_dev_info structure doesn't expose information about the max > > number of segments it can handle. > > > > PS: For backwards compatibility, we might define > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to > support the > > current "minimum requirements", and add > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload. > > > > > > > > > > > For reference, consider RSS, where the feature support flags = have > very > > > high granularity. > > > > > > > > > > > > > > > You mention the bonding driver, which is a good example. > > > > > > The rte_eth_tx_burst() documentation has a note about the = API > > > postcondition > > > > > exception for the bonding driver: > > > > > > "This function must not modify mbufs (including packets = data) > > > unless the > > > > > refcnt is 1. An exception is the bonding PMD, [...], mbufs > > > > > > may be modified." > > > > > > > > > > For me, what we've done for bonding tx_prepare/tx_burst() is a > > > really bad > > > > > example. > > > > > Initial agreement and design choice was that tx_burst() should > not > > > modify > > > > > contents of the packets > > > > > (that actually was one of the reasons why tx_prepare() was > > > introduced). > > > > > The only reason I agreed on that exception - because I = couldn't > > > come-up with > > > > > something less uglier. > > > > > > > > > > Actually, these problems with bonding PMD made me to start > thinking > > > that > > > > > current > > > > > tx_prepare/tx_burst approach might need to be reconsidered > somehow. > > > > > > > > In cases where a preceding call to tx_prepare() is required, how > is it > > > worse modifying the packet in tx_burst() than modifying the > > > > packet in tx_prepare()? > > > > > > > > Both cases violate the postcondition that packets are not = modified > at > > > egress. > > > > > > > > > > > > > > > > Then we can probably have one common tx_prepare() for all > > > vendors ;) > > > > > > > > > > > > Yes, that would be the goal. > > > > > > More realistically, the ethdev layer could perform the = common > > > checks, and > > > > > only the non-conforming drivers would have to implement > > > > > > their specific tweaks. > > > > > > > > > > Hmm, but that's what we have right now: > > > > > - fields in mbuf and packet data that user has to fill = correctly > and > > > dev > > > > > specific tx_prepare(). > > > > > How what you suggest will differ then? > > > > > > > > You're 100 % right here. We could move more checks into the = ethdev > > > layer, specifically checks related to the "minimum > > > > requirements". > > > > > > > > > And how it will help let say with bonding PMD situation, or = with > TX- > > > ing of the > > > > > same packet over 2 different NICs? > > > > > > > > The bonding driver is broken. > > > > It can only be fixed by not violating the egress postcondition = in > > > either tx_burst() or tx_prepare(). > > > > "Minimum requirements" might help doing that. > > > > > > > > > > > > > > > If we don't standardize the meaning of the offload flags, = the > > > application > > > > > developers cannot trust them! > > > > > > I'm afraid this is the current situation - application > developers > > > either > > > > > test with specific NIC hardware, or don't use the offload > features. > > > > > > > > > > Well, I have used TX offloads through several projects, it > worked > > > quite well. > > > > > > > > That is good to hear. > > > > And I don't oppose to that. > > > > > > > > In this discussion, I am worried about the roadmap direction for > DPDK. > > > > I oppose to the concept of requiring calling tx_prepare() before > > > calling tx_burst() when using offload. I think it is conceptually > wrong, > > > > and breaks the egress postcondition. > > > > I propose "minimum requirements" as a better solution. > > > > > > > > > Though have to admit, never have to use TX offloads together > with > > > our bonding > > > > > PMD. > > > > >