From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
"David Marchand" <david.marchand@redhat.com>,
"dev@dpdk.org" <dev@dpdk.org>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
"ferruh.yigit@amd.com" <ferruh.yigit@amd.com>,
"stable@dpdk.org" <stable@dpdk.org>,
Olivier Matz <olivier.matz@6wind.com>,
Jijiang Liu <jijiang.liu@intel.com>,
"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
Ferruh Yigit <ferruh.yigit@amd.com>,
Kaiwen Deng <kaiwenx.deng@intel.com>,
"qiming.yang@intel.com" <qiming.yang@intel.com>,
"yidingx.zhou@intel.com" <yidingx.zhou@intel.com>,
Aman Singh <aman.deep.singh@intel.com>,
"Yuying Zhang" <yuying.zhang@intel.com>,
Thomas Monjalon <thomas@monjalon.net>,
"Jerin Jacob" <jerinj@marvell.com>
Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
Date: Fri, 12 Apr 2024 15:17:23 +0000 [thread overview]
Message-ID: <be266bb11e524098a33fc2834c4e6993@huawei.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F39F@smartserver.smartshare.dk>
>
> > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> > offload
> > > > > > > > examples.
> > > > > > >
> > > > > > > I strongly disagree with this change!
> > > > > > >
> > > > > > > It will cause a huge performance degradation for shaping
> > applications:
> > > > > > >
> > > > > > > A packet will be processed and finalized at an output or
> > forwarding
> > > > > > pipeline stage, where some other fields might also be written,
> > so
> > > > > > > zeroing e.g. the out_ip checksum at this stage has low cost
> > (no new
> > > > > > cache misses).
> > > > > > >
> > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > >
> > > > > > > If rte_eth_tx_prepare() must be called at the egress pipeline
> > stage,
> > > > > > it has to write to the packet and cause a cache miss per packet,
> > > > > > > instead of simply passing on the packet to the NIC hardware.
> > > > > > >
> > > > > > > It must be possible to finalize the packet at the
> > output/forwarding
> > > > > > pipeline stage!
> > > > > >
> > > > > > If you can finalize your packet on output/forwarding, then why
> > you
> > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > There seems to be some misunderstanding about what tx_prepare()
> > does -
> > > > > > in fact it doesn't communicate with HW queue (doesn't update TXD
> > ring,
> > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > Yes, it reads some fields in SW TX queue struct (max number of
> > TXDs per
> > > > > > packet, etc.), but AFAIK it is safe
> > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > At least on implementations I am aware about.
> > > > > > Just checked the docs - it seems not stated explicitly anywhere,
> > might
> > > > > > be that's why it causing such misunderstanding.
> > > > > >
> > > > > > >
> > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned
> > packets
> > > > > > egressing on different NIC hardware?
> > > > > >
> > > > > > If you create a clone of full packet (including L2/L3) headers
> > then
> > > > > > obviously such construction might not
> > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > Though In majority of cases you do clone segments with data,
> > while at
> > > > > > least L2 headers are put into different segments.
> > > > > > One simple approach would be to keep L3 header in that separate
> > segment.
> > > > > > But yes, there is a problem when you'll need to send exactly the
> > same
> > > > > > packet over different NICs.
> > > > > > As I remember, for bonding PMD things don't work quite well here
> > - you
> > > > > > might have a bond over 2 NICs with
> > > > > > different tx_prepare() and which one to call might be not clear
> > till
> > > > > > actual PMD tx_burst() is invoked.
> > > > > >
> > > > > > >
> > > > > > > In theory, it might get even worse if we make this opaque
> > instead of
> > > > > > transparent and standardized:
> > > > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD
> > might
> > > > > > reset it to 0xFFFF.
> > > > > >
> > > > > > >
> > > > > > > I can only see one solution:
> > > > > > > We need to standardize on common minimum requirements for how
> > to
> > > > > > prepare packets for each TX offload.
> > > > > >
> > > > > > If we can make each and every vendor to agree here - that
> > definitely
> > > > > > will help to simplify things quite a bit.
> > > > >
> > > > > An API is more than a function name and parameters.
> > > > > It also has preconditions and postconditions.
> > > > >
> > > > > All major NIC vendors are contributing to DPDK.
> > > > > It should be possible to reach consensus for reasonable minimum
> > requirements
> > > > for offloads.
> > > > > Hardware- and driver-specific exceptions can be documented with
> > the offload
> > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > rte_eth_rx_burst():
> > > > > "Some drivers using vector instructions require that nb_pkts is
> > divisible by
> > > > 4 or 8, depending on the driver implementation."
> > > >
> > > > If we introduce a rule that everyone supposed to follow and then
> > straightway
> > > > allow people to have a 'documented exceptions',
> > > > for me it means like 'no rule' in practice.
> > > > A 'documented exceptions' approach might work if you have 5
> > different PMDs to
> > > > support, but not when you have 50+.
> > > > No-one would write an app with possible 10 different exception cases
> > in his
> > > > head.
> > > > Again, with such approach we can forget about backward
> > compatibility.
> > > > I think we already had this discussion before, my opinion remains
> > the same
> > > > here -
> > > > 'documented exceptions' approach is a way to trouble.
> > >
> > > The "minimum requirements" should be the lowest common denominator of
> > all NICs.
> > > Exceptions should be extremely few, for outlier NICs that still want
> > to provide an offload and its driver is unable to live up to the
> > > minimum requirements.
> > > Any exception should require techboard approval. If a NIC/driver does
> > not support the "minimum requirements" for an offload
> > > feature, it is not allowed to claim support for that offload feature,
> > or needs to seek approval for an exception.
> > >
> > > As another option for NICs not supporting the minimum requirements of
> > an offload feature, we could introduce offload flags with
> > > finer granularity. E.g. one offload flag for "gold standard" TX
> > checksum update (where the packet's checksum field can have any
> > > value), and another offload flag for "silver standard" TX checksum
> > update (where the packet's checksum field must have a
> > > precomputed value).
> >
> > Actually yes, I was thinking in the same direction - we need some extra
> > API to allow user to distinguish.
> > Probably we can do something like that: a new API for the ethdev call
> > that would take as a parameter
> > TX offloads bitmap and in return specify would it need to modify
> > contents of packet to support these
> > offloads or not.
> > Something like:
> > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> >
> > For the majority of the drivers that satisfy these "minimum
> > requirements" corresponding devops
> > entry will be empty and we'll always return 0, otherwise PMD has to
> > provide a proper devop.
> > Then again, it would be up to the user, to determine can he pass same
> > packet to 2 different NICs or not.
> >
> > I suppose it is similar to what you were talking about?
>
> I was thinking something more simple:
>
> The NIC exposes its RX and TX offload capabilities to the application through the rx/tx_offload_capa and other fields in the
> rte_eth_dev_info structure returned by rte_eth_dev_info_get().
>
> E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag set.
> These capability flags (or enums) are mostly undocumented in the code, but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> capability means that the NIC is able to update the IPv4 header checksum at egress (on the wire, i.e. without modifying the mbuf or
> packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this offload.
> I would define and document what each capability flag/enum exactly means, the minimum requirements (as defined by the DPDK
> community) for the driver to claim support for it, and the requirements for an application to use it.
> For the sake of discussion, let's say that RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update capability
> (i.e. no requirements to the checksum field in the packet contents).
> If some NIC requires the checksum field in the packet contents to have a precomputed value, the NIC would not be allowed to claim
> the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> Such a NIC would need to define and document a new capability, e.g. RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> standard" TX checksum update capability.
> In other words: I would encode variations of offload capabilities directly in the capabilities flags.
> Then we don't need additional APIs to help interpret those capabilities.
I understood your intention with different flags, yes it should work too I think.
The reason I am not very fond of it - it will require to double TX_OFFLOAD flags.
> This way, the application can probe the NIC capabilities to determine what can be offloaded, and how to do it.
>
> The application can be designed to:
> 1. use a common packet processing pipeline, utilizing only the lowest common capabilities denominator of all detected NICs, or
> 2. use a packet processing pipeline, handling packets differently according to the capabilities of the involved NICs.
>
> NB: There may be other variations than requiring packet contents to be modified, and they might be granular.
> E.g. a NIC might require assistance for TCP/UDP checksum offload, but not for IP checksum offload, so a function telling if packet
> contents requires modification would not suffice.
Why not?
If user plans to use multiple tx offloads provide a bitmask of all of them as an argument.
Let say for both L3 and L4 cksum offloads it will be something like:
(RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)
> E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the rte_eth_dev_info structure doesn't expose information about the max
> number of segments it can handle.
>
> PS: For backwards compatibility, we might define RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to support the
> current "minimum requirements", and add RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
>
>
> >
> > > For reference, consider RSS, where the feature support flags have very
> > high granularity.
> > >
> > > >
> > > > > You mention the bonding driver, which is a good example.
> > > > > The rte_eth_tx_burst() documentation has a note about the API
> > postcondition
> > > > exception for the bonding driver:
> > > > > "This function must not modify mbufs (including packets data)
> > unless the
> > > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > > may be modified."
> > > >
> > > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> > really bad
> > > > example.
> > > > Initial agreement and design choice was that tx_burst() should not
> > modify
> > > > contents of the packets
> > > > (that actually was one of the reasons why tx_prepare() was
> > introduced).
> > > > The only reason I agreed on that exception - because I couldn't
> > come-up with
> > > > something less uglier.
> > > >
> > > > Actually, these problems with bonding PMD made me to start thinking
> > that
> > > > current
> > > > tx_prepare/tx_burst approach might need to be reconsidered somehow.
> > >
> > > In cases where a preceding call to tx_prepare() is required, how is it
> > worse modifying the packet in tx_burst() than modifying the
> > > packet in tx_prepare()?
> > >
> > > Both cases violate the postcondition that packets are not modified at
> > egress.
> > >
> > > >
> > > > > > Then we can probably have one common tx_prepare() for all
> > vendors ;)
> > > > >
> > > > > Yes, that would be the goal.
> > > > > More realistically, the ethdev layer could perform the common
> > checks, and
> > > > only the non-conforming drivers would have to implement
> > > > > their specific tweaks.
> > > >
> > > > Hmm, but that's what we have right now:
> > > > - fields in mbuf and packet data that user has to fill correctly and
> > dev
> > > > specific tx_prepare().
> > > > How what you suggest will differ then?
> > >
> > > You're 100 % right here. We could move more checks into the ethdev
> > layer, specifically checks related to the "minimum
> > > requirements".
> > >
> > > > And how it will help let say with bonding PMD situation, or with TX-
> > ing of the
> > > > same packet over 2 different NICs?
> > >
> > > The bonding driver is broken.
> > > It can only be fixed by not violating the egress postcondition in
> > either tx_burst() or tx_prepare().
> > > "Minimum requirements" might help doing that.
> > >
> > > >
> > > > > If we don't standardize the meaning of the offload flags, the
> > application
> > > > developers cannot trust them!
> > > > > I'm afraid this is the current situation - application developers
> > either
> > > > test with specific NIC hardware, or don't use the offload features.
> > > >
> > > > Well, I have used TX offloads through several projects, it worked
> > quite well.
> > >
> > > That is good to hear.
> > > And I don't oppose to that.
> > >
> > > In this discussion, I am worried about the roadmap direction for DPDK.
> > > I oppose to the concept of requiring calling tx_prepare() before
> > calling tx_burst() when using offload. I think it is conceptually wrong,
> > > and breaks the egress postcondition.
> > > I propose "minimum requirements" as a better solution.
> > >
> > > > Though have to admit, never have to use TX offloads together with
> > our bonding
> > > > PMD.
> > > >
next prev parent reply other threads:[~2024-04-12 15:17 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-05 12:49 [PATCH 0/8] Fix outer UDP checksum for Intel nics David Marchand
2024-04-05 12:49 ` [PATCH 1/8] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-05 14:19 ` David Marchand
2024-04-05 12:49 ` [PATCH 2/8] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-05 12:49 ` [PATCH 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 12:49 ` [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-05 12:49 ` [PATCH 8/8] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:45 ` [PATCH v2 0/8] Fix outer UDP checksum for Intel nics David Marchand
2024-04-05 14:45 ` [PATCH v2 1/8] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-08 15:05 ` Bruce Richardson
2024-04-05 14:45 ` [PATCH v2 2/8] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-08 15:23 ` Bruce Richardson
2024-04-11 8:30 ` David Marchand
2024-04-11 8:42 ` Bruce Richardson
2024-04-15 8:32 ` David Marchand
2024-04-05 14:45 ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 16:20 ` Morten Brørup
2024-04-08 10:12 ` David Marchand
2024-04-09 13:38 ` Konstantin Ananyev
2024-04-09 14:44 ` Morten Brørup
2024-04-10 10:35 ` Konstantin Ananyev
2024-04-10 12:20 ` Morten Brørup
2024-04-12 12:46 ` Konstantin Ananyev
2024-04-12 14:44 ` Morten Brørup
2024-04-12 15:17 ` Konstantin Ananyev [this message]
2024-04-12 15:54 ` Morten Brørup
2024-04-16 9:16 ` Konstantin Ananyev
2024-04-16 11:36 ` Konstantin Ananyev
2024-04-15 15:07 ` Ferruh Yigit
2024-04-16 7:14 ` Morten Brørup
2024-04-16 9:26 ` Konstantin Ananyev
2024-04-05 14:45 ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 14:45 ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:46 ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 14:46 ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-05 14:46 ` [PATCH v2 8/8] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-04-18 8:20 ` [PATCH v3 0/7] Fix outer UDP checksum for Intel nics David Marchand
2024-04-18 8:20 ` [PATCH v3 1/7] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-19 11:46 ` Morten Brørup
2024-04-18 8:20 ` [PATCH v3 2/7] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-18 8:20 ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
2024-06-11 18:25 ` Ferruh Yigit
2024-04-18 8:20 ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-18 8:20 ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-18 8:20 ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-18 8:20 ` [PATCH v3 7/7] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-05-03 13:10 ` [PATCH v3 0/7] Fix outer UDP checksum for Intel nics David Marchand
2024-05-27 18:06 ` Ali Alnubani
2024-06-11 21:10 ` Ferruh Yigit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=be266bb11e524098a33fc2834c4e6993@huawei.com \
--to=konstantin.ananyev@huawei.com \
--cc=aman.deep.singh@intel.com \
--cc=andrew.rybchenko@oktetlabs.ru \
--cc=david.marchand@redhat.com \
--cc=dev@dpdk.org \
--cc=ferruh.yigit@amd.com \
--cc=jerinj@marvell.com \
--cc=jijiang.liu@intel.com \
--cc=kaiwenx.deng@intel.com \
--cc=mb@smartsharesystems.com \
--cc=olivier.matz@6wind.com \
--cc=qiming.yang@intel.com \
--cc=stable@dpdk.org \
--cc=thomas@monjalon.net \
--cc=yidingx.zhou@intel.com \
--cc=yuying.zhang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).