From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id F304043E70
	for <public@inbox.dpdk.org>; Tue, 16 Apr 2024 13:36:31 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id E6F974029E;
	Tue, 16 Apr 2024 13:36:31 +0200 (CEST)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com
 [185.176.79.56])
 by mails.dpdk.org (Postfix) with ESMTP id 5810540262;
 Tue, 16 Apr 2024 13:36:30 +0200 (CEST)
Received: from mail.maildlp.com (unknown [172.18.186.31])
 by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4VJhh418qTz6J80b;
 Tue, 16 Apr 2024 19:31:32 +0800 (CST)
Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172])
 by mail.maildlp.com (Postfix) with ESMTPS id 451F31400F4;
 Tue, 16 Apr 2024 19:36:28 +0800 (CST)
Received: from frapeml500007.china.huawei.com (7.182.85.172) by
 frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.35; Tue, 16 Apr 2024 13:36:28 +0200
Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by
 frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.035;
 Tue, 16 Apr 2024 13:36:28 +0200
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
 =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>, David Marchand
 <david.marchand@redhat.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@amd.com"
 <ferruh.yigit@amd.com>, "stable@dpdk.org" <stable@dpdk.org>, Olivier Matz
 <olivier.matz@6wind.com>, Jijiang Liu <jijiang.liu@intel.com>, "Andrew
 Rybchenko" <andrew.rybchenko@oktetlabs.ru>, Ferruh Yigit
 <ferruh.yigit@amd.com>, Kaiwen Deng <kaiwenx.deng@intel.com>,
 "qiming.yang@intel.com" <qiming.yang@intel.com>, "yidingx.zhou@intel.com"
 <yidingx.zhou@intel.com>, Aman Singh <aman.deep.singh@intel.com>, "Yuying
 Zhang" <yuying.zhang@intel.com>, Thomas Monjalon <thomas@monjalon.net>,
 "Jerin Jacob" <jerinj@marvell.com>
Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
Thread-Topic: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
Thread-Index: AQHah2gcwh+zxBWPrEOmSW2nVSxrXbFZuegAgAX4BtCAAE6MMIABTCywgAAbf0CAAzI3UIAAC0aQgAAit/CAAAXFUIAF1aYQgAAgzWA=
Date: Tue, 16 Apr 2024 11:36:27 +0000
Message-ID: <b4b19af1b13c436ebb13cda5ced25470@huawei.com>
References: <20240405125039.897933-1-david.marchand@redhat.com>
 <20240405144604.906695-1-david.marchand@redhat.com>
 <20240405144604.906695-4-david.marchand@redhat.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F36C@smartserver.smartshare.dk>
 <10b564b42f8d4db387f6302701f24ce3@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F381@smartserver.smartshare.dk>
 <409157f5da3e4c628ca678dd9e2c7957@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F38F@smartserver.smartshare.dk>
 <52850a78c83445548a0b78bfd04e6f91@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F39F@smartserver.smartshare.dk>
 <be266bb11e524098a33fc2834c4e6993@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F3A0@smartserver.smartshare.dk>
 <3f8214e0bcb448338cf2679f753a983d@huawei.com>
In-Reply-To: <3f8214e0bcb448338cf2679f753a983d@huawei.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.206.138.42]
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-BeenThere: stable@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: patches for DPDK stable branches <stable.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/stable>,
 <mailto:stable-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/stable/>
List-Post: <mailto:stable@dpdk.org>
List-Help: <mailto:stable-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/stable>,
 <mailto:stable-request@dpdk.org?subject=subscribe>
Errors-To: stable-bounces@dpdk.org


> > > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx
> > > checksum
> > > > > offload
> > > > > > > > > > > examples.
> > > > > > > > > >
> > > > > > > > > > I strongly disagree with this change!
> > > > > > > > > >
> > > > > > > > > > It will cause a huge performance degradation for shapin=
g
> > > > > applications:
> > > > > > > > > >
> > > > > > > > > > A packet will be processed and finalized at an output o=
r
> > > > > forwarding
> > > > > > > > > pipeline stage, where some other fields might also be
> > > written,
> > > > > so
> > > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low
> > > cost
> > > > > (no new
> > > > > > > > > cache misses).
> > > > > > > > > >
> > > > > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > > > > >
> > > > > > > > > > If rte_eth_tx_prepare() must be called at the egress
> > > pipeline
> > > > > stage,
> > > > > > > > > it has to write to the packet and cause a cache miss per
> > > packet,
> > > > > > > > > > instead of simply passing on the packet to the NIC
> > > hardware.
> > > > > > > > > >
> > > > > > > > > > It must be possible to finalize the packet at the
> > > > > output/forwarding
> > > > > > > > > pipeline stage!
> > > > > > > > >
> > > > > > > > > If you can finalize your packet on  output/forwarding, th=
en
> > > why
> > > > > you
> > > > > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > > > > There seems to be some misunderstanding about what
> > > tx_prepare()
> > > > > does -
> > > > > > > > > in fact it doesn't communicate with HW queue (doesn't upd=
ate
> > > TXD
> > > > > ring,
> > > > > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > > > > Yes, it reads some fields in SW TX queue struct (max numb=
er
> > > of
> > > > > TXDs per
> > > > > > > > > packet, etc.), but AFAIK it is safe
> > > > > > > > > to call tx_prepare() and tx_burst() from different thread=
s.
> > > > > > > > > At least on implementations I am aware about.
> > > > > > > > > Just checked the docs - it seems not stated explicitly
> > > anywhere,
> > > > > might
> > > > > > > > > be that's why it causing such misunderstanding.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for
> > > cloned
> > > > > packets
> > > > > > > > > egressing on different NIC hardware?
> > > > > > > > >
> > > > > > > > > If you create a clone of full packet (including L2/L3)
> > > headers
> > > > > then
> > > > > > > > > obviously such construction might not
> > > > > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > > > > Though In majority of cases you do clone segments with da=
ta,
> > > > > while at
> > > > > > > > > least L2 headers are put into different segments.
> > > > > > > > > One simple approach would be to keep L3 header in that
> > > separate
> > > > > segment.
> > > > > > > > > But yes, there is a problem when you'll need to send exac=
tly
> > > the
> > > > > same
> > > > > > > > > packet over different NICs.
> > > > > > > > > As I remember, for bonding PMD things don't work quite we=
ll
> > > here
> > > > > - you
> > > > > > > > > might have a bond over 2 NICs with
> > > > > > > > > different tx_prepare() and which one to call might be not
> > > clear
> > > > > till
> > > > > > > > > actual PMD tx_burst() is invoked.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In theory, it might get even worse if we make this opaq=
ue
> > > > > instead of
> > > > > > > > > transparent and standardized:
> > > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and anot=
her
> > > PMD
> > > > > might
> > > > > > > > > reset it to 0xFFFF.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can only see one solution:
> > > > > > > > > > We need to standardize on common minimum requirements f=
or
> > > how
> > > > > to
> > > > > > > > > prepare packets for each TX offload.
> > > > > > > > >
> > > > > > > > > If we can make each and every vendor to agree here - that
> > > > > definitely
> > > > > > > > > will help to simplify things quite a bit.
> > > > > > > >
> > > > > > > > An API is more than a function name and parameters.
> > > > > > > > It also has preconditions and postconditions.
> > > > > > > >
> > > > > > > > All major NIC vendors are contributing to DPDK.
> > > > > > > > It should be possible to reach consensus for reasonable
> > > minimum
> > > > > requirements
> > > > > > > for offloads.
> > > > > > > > Hardware- and driver-specific exceptions can be documented
> > > with
> > > > > the offload
> > > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > > > > rte_eth_rx_burst():
> > > > > > > > "Some drivers using vector instructions require that nb_pkt=
s
> > > is
> > > > > divisible by
> > > > > > > 4 or 8, depending on the driver implementation."
> > > > > > >
> > > > > > > If we introduce a rule that everyone supposed to follow and t=
hen
> > > > > straightway
> > > > > > > allow people to have a 'documented exceptions',
> > > > > > > for me it means like 'no rule' in practice.
> > > > > > > A 'documented exceptions' approach might work if you have 5
> > > > > different PMDs to
> > > > > > > support, but not when you have 50+.
> > > > > > > No-one would write an app with possible 10 different exceptio=
n
> > > cases
> > > > > in his
> > > > > > > head.
> > > > > > > Again, with such approach we can forget about backward
> > > > > compatibility.
> > > > > > > I think we already had this discussion before, my opinion
> > > remains
> > > > > the same
> > > > > > > here -
> > > > > > > 'documented exceptions' approach is a way to trouble.
> > > > > >
> > > > > > The "minimum requirements" should be the lowest common denomina=
tor
> > > of
> > > > > all NICs.
> > > > > > Exceptions should be extremely few, for outlier NICs that still
> > > want
> > > > > to provide an offload and its driver is unable to live up to the
> > > > > > minimum requirements.
> > > > > > Any exception should require techboard approval. If a NIC/drive=
r
> > > does
> > > > > not support the "minimum requirements" for an offload
> > > > > > feature, it is not allowed to claim support for that offload
> > > feature,
> > > > > or needs to seek approval for an exception.
> > > > > >
> > > > > > As another option for NICs not supporting the minimum requireme=
nts
> > > of
> > > > > an offload feature, we could introduce offload flags with
> > > > > > finer granularity. E.g. one offload flag for "gold standard" TX
> > > > > checksum update (where the packet's checksum field can have any
> > > > > > value), and another offload flag for "silver standard" TX check=
sum
> > > > > update (where the packet's checksum field must have a
> > > > > > precomputed value).
> > > > >
> > > > > Actually yes, I was thinking in the same direction - we need some
> > > extra
> > > > > API to allow user to distinguish.
> > > > > Probably we can do something like that: a new API for the ethdev
> > > call
> > > > > that would take as a parameter
> > > > > TX offloads bitmap and in return specify would it need to modify
> > > > > contents of packet to support these
> > > > > offloads or not.
> > > > > Something like:
> > > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > > > >
> > > > > For the majority of the drivers that satisfy these "minimum
> > > > > requirements" corresponding devops
> > > > > entry will be empty and we'll always return 0, otherwise PMD has =
to
> > > > > provide a proper devop.
> > > > > Then again, it would be up to the user, to determine can he pass
> > > same
> > > > > packet to 2 different NICs or not.
> > > > >
> > > > > I suppose it is similar to what you were talking about?
> > > >
> > > > I was thinking something more simple:
> > > >
> > > > The NIC exposes its RX and TX offload capabilities to the applicati=
on
> > > through the rx/tx_offload_capa and other fields in the
> > > > rte_eth_dev_info structure returned by rte_eth_dev_info_get().
> > > >
> > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM f=
lag
> > > set.
> > > > These capability flags (or enums) are mostly undocumented in the co=
de,
> > > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> > > > capability means that the NIC is able to update the IPv4 header
> > > checksum at egress (on the wire, i.e. without modifying the mbuf or
> > > > packet data), and that the application must set RTE_MBUF_F_TX_IP_CK=
SUM
> > > in the mbufs to utilize this offload.
> > > > I would define and document what each capability flag/enum exactly
> > > means, the minimum requirements (as defined by the DPDK
> > > > community) for the driver to claim support for it, and the
> > > requirements for an application to use it.
> > > > For the sake of discussion, let's say that
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum updat=
e
> > > capability
> > > > (i.e. no requirements to the checksum field in the packet contents)=
.
> > > > If some NIC requires the checksum field in the packet contents to h=
ave
> > > a precomputed value, the NIC would not be allowed to claim
> > > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > > Such a NIC would need to define and document a new capability, e.g.
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> > > > standard" TX checksum update capability.
> > > > In other words: I would encode variations of offload capabilities
> > > directly in the capabilities flags.
> > > > Then we don't need additional APIs to help interpret those
> > > capabilities.
> > >
> > > I understood your intention with different flags, yes it should work =
too
> > > I think.
> > > The reason I am not very fond of it - it will require to double
> > > TX_OFFLOAD flags.
> >
> > An additional feature flag is only required if a NIC is not conforming =
to the "minimum requirements" of an offload feature, and the
> > techboard permits introducing a variant of an existing feature.
> > There should be very few additional feature flags for variants - except=
ions only - or the "minimum requirements" are not broad
> > enough to support the majority of NICs.
>=20
> Ok, so you suggest to group all existing reqs plus what all current tx_pr=
epare() do into "minimum requirements"?
> So with current drivers in place we wouldn't need these new flags, but we=
'll reserve such opportunity.
> That might work, if there are no contradictory requirements in current PM=
Ds, and PMDs maintainers with
> less reqs will agree with these 'extra' stuff.

Just to check how easy/hard would be to get a consensus, compiled a list of=
 mbuf changes
done by different PMDs in tx_prepare(). See below.
Could be not fully correct or complete. PMD maintainers, feel free to updat=
e it, if I missed something.
>From how it looks to me:
if we'll go the way you suggest, then hns3 and virtio will most likely beco=
me
a 'second class citizens' - will need a special offload flags for them.
Plus, either all PMDs that now set tx_prepare()=3DNULL will have to agree t=
o require
rte_net_intel_cksum_prepare() to be done, or all Intel PMDs and few others =
will also be downgraded
to 'second class'.

PMD: atlantic
MOD: rte_net_intel_cksum_prepare()
/*for  ipv4_hdr->hdr_checksum =3D 0; (tcp|udp)_hdr->cksum=3Drte_ipv(4|6)_ph=
dr_cksum(...);*/

PMD: cpfl/idpf
MOD: none

PMD: em/igb/igc/fm10k/i40e/iavf/ice/ixgbe
MOD: rte_net_intel_cksum_prepare()

PMD: enic
MOD: rte_net_intel_cksum_prepare()

PMD: hns3
MOD: rte_net_intel_cksum_prepare() plus some extra:
            /*
         * A UDP packet with the same dst_port as VXLAN\VXLAN_GPE\GENEVE wi=
ll
         * be recognized as a tunnel packet in HW. In this case, if UDP CKS=
UM
         * offload is set and the tunnel mask has not been set, the CKSUM w=
ill
         * be wrong since the header length is wrong and driver should comp=
lete
         * the CKSUM to avoid CKSUM error.
         */

PMD: ionic
MOD: none

PMD: ngbe
MOD: rte_net_intel_cksum_prepare()

PMD: qede
MOD: none

PMD: txgbe
MOD: rte_net_intel_cksum_prepare()

PMD: virtio:
MOD: rte_net_intel_cksum_prepare() plus some extra:
           - for RTE_MBUF_F_TX_TCP_SEG: virtio_tso_fix_cksum()
           - for RTE_MBUF_F_TX_VLAN: rte_vlan_insert()

PMD: vmxnet3
MOD: rte_net_intel_cksum_prepare()

For all other PMDs in our main tree set tx_prepare =3D NULL.