patches for DPDK stable branches
 help / color / mirror / Atom feed
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Ferruh Yigit" <ferruh.yigit@amd.com>,
	"David Marchand" <david.marchand@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
	"stable@dpdk.org" <stable@dpdk.org>,
	Olivier Matz <olivier.matz@6wind.com>,
	Jijiang Liu <jijiang.liu@intel.com>,
	Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
	Kaiwen Deng <kaiwenx.deng@intel.com>,
	"qiming.yang@intel.com" <qiming.yang@intel.com>,
	"yidingx.zhou@intel.com" <yidingx.zhou@intel.com>,
	Aman Singh <aman.deep.singh@intel.com>,
	Yuying Zhang <yuying.zhang@intel.com>,
	Jerin Jacob <jerinj@marvell.com>
Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
Date: Tue, 16 Apr 2024 09:26:58 +0000	[thread overview]
Message-ID: <151ec53014f64068b07adb3645ba6da8@huawei.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F3A4@smartserver.smartshare.dk>



> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > Sent: Monday, 15 April 2024 17.08
> >
> > On 4/12/2024 3:44 PM, Morten Brørup wrote:
> > >>>>>>>> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> > >> offload
> > >>>>>>>> examples.
> > >>>>>>>
> > >>>>>>> I strongly disagree with this change!
> > >>>>>>>
> > >>>>>>> It will cause a huge performance degradation for shaping
> > >> applications:
> > >>>>>>>
> > >>>>>>> A packet will be processed and finalized at an output or
> > >> forwarding
> > >>>>>> pipeline stage, where some other fields might also be written,
> > >> so
> > >>>>>>> zeroing e.g. the out_ip checksum at this stage has low cost
> > >> (no new
> > >>>>>> cache misses).
> > >>>>>>>
> > >>>>>>> Then, the packet might be queued for QoS or similar.
> > >>>>>>>
> > >>>>>>> If rte_eth_tx_prepare() must be called at the egress pipeline
> > >> stage,
> > >>>>>> it has to write to the packet and cause a cache miss per packet,
> > >>>>>>> instead of simply passing on the packet to the NIC hardware.
> > >>>>>>>
> > >>>>>>> It must be possible to finalize the packet at the
> > >> output/forwarding
> > >>>>>> pipeline stage!
> > >>>>>>
> > >>>>>> If you can finalize your packet on  output/forwarding, then why
> > >> you
> > >>>>>> can't invoke tx_prepare() on the same stage?
> > >>>>>> There seems to be some misunderstanding about what tx_prepare()
> > >> does -
> > >>>>>> in fact it doesn't communicate with HW queue (doesn't update TXD
> > >> ring,
> > >>>>>> etc.), what it does - just make changes in mbuf itself.
> > >>>>>> Yes, it reads some fields in SW TX queue struct (max number of
> > >> TXDs per
> > >>>>>> packet, etc.), but AFAIK it is safe
> > >>>>>> to call tx_prepare() and tx_burst() from different threads.
> > >>>>>> At least on implementations I am aware about.
> > >>>>>> Just checked the docs - it seems not stated explicitly anywhere,
> > >> might
> > >>>>>> be that's why it causing such misunderstanding.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Also, how is rte_eth_tx_prepare() supposed to work for cloned
> > >> packets
> > >>>>>> egressing on different NIC hardware?
> > >>>>>>
> > >>>>>> If you create a clone of full packet (including L2/L3) headers
> > >> then
> > >>>>>> obviously such construction might not
> > >>>>>> work properly with tx_prepare() over two different NICs.
> > >>>>>> Though In majority of cases you do clone segments with data,
> > >> while at
> > >>>>>> least L2 headers are put into different segments.
> > >>>>>> One simple approach would be to keep L3 header in that separate
> > >> segment.
> > >>>>>> But yes, there is a problem when you'll need to send exactly the
> > >> same
> > >>>>>> packet over different NICs.
> > >>>>>> As I remember, for bonding PMD things don't work quite well here
> > >> - you
> > >>>>>> might have a bond over 2 NICs with
> > >>>>>> different tx_prepare() and which one to call might be not clear
> > >> till
> > >>>>>> actual PMD tx_burst() is invoked.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> In theory, it might get even worse if we make this opaque
> > >> instead of
> > >>>>>> transparent and standardized:
> > >>>>>>> One PMD might reset out_ip checksum to 0x0000, and another PMD
> > >> might
> > >>>>>> reset it to 0xFFFF.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> I can only see one solution:
> > >>>>>>> We need to standardize on common minimum requirements for how
> > >> to
> > >>>>>> prepare packets for each TX offload.
> > >>>>>>
> > >>>>>> If we can make each and every vendor to agree here - that
> > >> definitely
> > >>>>>> will help to simplify things quite a bit.
> > >>>>>
> > >>>>> An API is more than a function name and parameters.
> > >>>>> It also has preconditions and postconditions.
> > >>>>>
> > >>>>> All major NIC vendors are contributing to DPDK.
> > >>>>> It should be possible to reach consensus for reasonable minimum
> > >> requirements
> > >>>> for offloads.
> > >>>>> Hardware- and driver-specific exceptions can be documented with
> > >> the offload
> > >>>> flag, or with rte_eth_rx/tx_burst(), like the note to
> > >>>>> rte_eth_rx_burst():
> > >>>>> "Some drivers using vector instructions require that nb_pkts is
> > >> divisible by
> > >>>> 4 or 8, depending on the driver implementation."
> > >>>>
> > >>>> If we introduce a rule that everyone supposed to follow and then
> > >> straightway
> > >>>> allow people to have a 'documented exceptions',
> > >>>> for me it means like 'no rule' in practice.
> > >>>> A 'documented exceptions' approach might work if you have 5
> > >> different PMDs to
> > >>>> support, but not when you have 50+.
> > >>>> No-one would write an app with possible 10 different exception
> > cases
> > >> in his
> > >>>> head.
> > >>>> Again, with such approach we can forget about backward
> > >> compatibility.
> > >>>> I think we already had this discussion before, my opinion remains
> > >> the same
> > >>>> here -
> > >>>> 'documented exceptions' approach is a way to trouble.
> > >>>
> > >>> The "minimum requirements" should be the lowest common denominator
> > of
> > >> all NICs.
> > >>> Exceptions should be extremely few, for outlier NICs that still want
> > >> to provide an offload and its driver is unable to live up to the
> > >>> minimum requirements.
> > >>> Any exception should require techboard approval. If a NIC/driver
> > does
> > >> not support the "minimum requirements" for an offload
> > >>> feature, it is not allowed to claim support for that offload
> > feature,
> > >> or needs to seek approval for an exception.
> > >>>
> > >>> As another option for NICs not supporting the minimum requirements
> > of
> > >> an offload feature, we could introduce offload flags with
> > >>> finer granularity. E.g. one offload flag for "gold standard" TX
> > >> checksum update (where the packet's checksum field can have any
> > >>> value), and another offload flag for "silver standard" TX checksum
> > >> update (where the packet's checksum field must have a
> > >>> precomputed value).
> > >>
> > >> Actually yes, I was thinking in the same direction - we need some
> > extra
> > >> API to allow user to distinguish.
> > >> Probably we can do something like that: a new API for the ethdev call
> > >> that would take as a parameter
> > >> TX offloads bitmap and in return specify would it need to modify
> > >> contents of packet to support these
> > >> offloads or not.
> > >> Something like:
> > >> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > >>
> > >> For the majority of the drivers that satisfy these "minimum
> > >> requirements" corresponding devops
> > >> entry will be empty and we'll always return 0, otherwise PMD has to
> > >> provide a proper devop.
> > >> Then again, it would be up to the user, to determine can he pass same
> > >> packet to 2 different NICs or not.
> > >>
> > >> I suppose it is similar to what you were talking about?
> > >
> > > I was thinking something more simple:
> > >
> > > The NIC exposes its RX and TX offload capabilities to the application
> > through the rx/tx_offload_capa and other fields in the rte_eth_dev_info
> > structure returned by rte_eth_dev_info_get().
> > >
> > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> > set.
> > > These capability flags (or enums) are mostly undocumented in the code,
> > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that
> > the NIC is able to update the IPv4 header checksum at egress (on the
> > wire, i.e. without modifying the mbuf or packet data), and that the
> > application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this
> > offload.
> > > I would define and document what each capability flag/enum exactly
> > means, the minimum requirements (as defined by the DPDK community) for
> > the driver to claim support for it, and the requirements for an
> > application to use it.
> > >
> >
> > +1 to improve documentation, and clear offload where it is needed.
> >
> > Another gap is in testing, whenever a device/driver claims an offload
> > capability, we don't have a test suit to confirm and verify this claim.
> 
> Yep, conformance testing is lacking big time in our CI.
> Adding the ts-factory tests to the CI was a big step in this direction.
> But for now, we are mostly relying on vendor's internal testing.
> 
> >
> >
> > > For the sake of discussion, let's say that
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> > capability (i.e. no requirements to the checksum field in the packet
> > contents).
> > > If some NIC requires the checksum field in the packet contents to have
> > a precomputed value, the NIC would not be allowed to claim the
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > Such a NIC would need to define and document a new capability, e.g.
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX
> > checksum update capability.
> > >
> > > In other words: I would encode variations of offload capabilities
> > directly in the capabilities flags.
> > > Then we don't need additional APIs to help interpret those
> > capabilities.
> > >
> > > This way, the application can probe the NIC capabilities to determine
> > what can be offloaded, and how to do it.
> > >
> > > The application can be designed to:
> > > 1. use a common packet processing pipeline, utilizing only the lowest
> > common capabilities denominator of all detected NICs, or
> > > 2. use a packet processing pipeline, handling packets differently
> > according to the capabilities of the involved NICs.
> > >
> >
> > Offload capabilities are already provided to enable applications as you
> > mentioned above.
> >
> > Agree that '_ASSISTED' capability flags gives more details to
> > application to manage it but my concern is it complicates offloading
> > more.
> >
> > The number of "assisted" offloads is not much, mainly it is for cksum.
> > Current approach is simpler, devices requires precondition implements it
> > in 'tx_prepare' and application using these offloads calls before
> > 'tx_burst'. Device/drivers doesn't require it don't have 'tx_prepare' so
> > no impact to the application.
> >
> > Will it work to have something in between, another capability to hold if
> > 'tx_prepare' needs to be called with this device or not?
> > It can be possible to make this variable 32bits that holds offloads
> > require assistance in a bit-wise way, but I prefer simple if
> > 'tx_prepare' required flag, this may help application to decide to use
> > offloads in that device and to call 'tx_prepare' or not.
> 
> Consider an IP Multicast packet to be transmitted on many ports.
> 
> With my suggestion, the packet can be prepared once - i.e. the output/forwarding stage sets the IP header checksum as required by
> the (lowest common denominator) offload capabilities flag - before being cloned for tx() on each port.
> 
> With tx_prepare(), the packet needs to be copied (deep copy!) for each port, and then tx_prepare() needs to be called for each port
> before tx().

Not really. At least not always.
Let say for multicast, you'll most likely will need to have a different L2 header for each ethdev port anyway.
Usual way to overcome it - have a separate segment for L2 header attached to the rest of the packet data segment.
Nothing stops you to have inner/outer L3/L4 headers in that 'header' segment too.
Then you do need to have a multiple copies of data segment.

Actually after another thought - that might be a simple way to fix problem with bonding PMD:
allow it to create a 'header' segments on need.
Not sure how it will affect performance, but should help to avoid current problem when packet contents
are modified in tx_burst().

> The opaque tx_prepare() concept violates the concept of not modifying the packet at egress.
 
That actually depends where you draw a line for egress: before or after tx_prepare(). 

> DPDK follows a principle of not using opaque types, so application developers can optimize accordingly. The opaque behavior of
> tx_prepare() violates this principle, and I consider it a horrible hack!
> >
> > > NB: There may be other variations than requiring packet contents to be
> > modified, and they might be granular.
> > > E.g. a NIC might require assistance for TCP/UDP checksum offload, but
> > not for IP checksum offload, so a function telling if packet contents
> > requires modification would not suffice.
> > > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the
> > rte_eth_dev_info structure doesn't expose information about the max
> > number of segments it can handle.
> > >
> >
> > Another good point to report max number of segment can be handled, +1
> >
> >
> > > PS: For backwards compatibility, we might define
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to
> > support the current "minimum requirements", and add
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
> > >
> > >
> > >>
> > >>> For reference, consider RSS, where the feature support flags have
> > very
> > >> high granularity.
> > >>>
> > >>>>
> > >>>>> You mention the bonding driver, which is a good example.
> > >>>>> The rte_eth_tx_burst() documentation has a note about the API
> > >> postcondition
> > >>>> exception for the bonding driver:
> > >>>>> "This function must not modify mbufs (including packets data)
> > >> unless the
> > >>>> refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > >>>>> may be modified."
> > >>>>
> > >>>> For me, what we've done for bonding tx_prepare/tx_burst() is a
> > >> really bad
> > >>>> example.
> > >>>> Initial agreement and design choice was that tx_burst() should not
> > >> modify
> > >>>> contents of the packets
> > >>>> (that actually was one of the reasons why tx_prepare() was
> > >> introduced).
> > >>>> The only reason I agreed on that exception - because I couldn't
> > >> come-up with
> > >>>> something less uglier.
> > >>>>
> > >>>> Actually, these problems with bonding PMD made me to start thinking
> > >> that
> > >>>> current
> > >>>> tx_prepare/tx_burst approach might need to be reconsidered somehow.
> > >>>
> > >>> In cases where a preceding call to tx_prepare() is required, how is
> > it
> > >> worse modifying the packet in tx_burst() than modifying the
> > >>> packet in tx_prepare()?
> > >>>
> > >>> Both cases violate the postcondition that packets are not modified
> > at
> > >> egress.
> > >>>
> > >>>>
> > >>>>>> Then we can probably have one common tx_prepare() for all
> > >> vendors ;)
> > >>>>>
> > >>>>> Yes, that would be the goal.
> > >>>>> More realistically, the ethdev layer could perform the common
> > >> checks, and
> > >>>> only the non-conforming drivers would have to implement
> > >>>>> their specific tweaks.
> > >>>>
> > >>>> Hmm,  but that's what we have right now:
> > >>>> - fields in mbuf and packet data that user has to fill correctly
> > and
> > >> dev
> > >>>> specific tx_prepare().
> > >>>> How what you suggest will differ then?
> > >>>
> > >>> You're 100 % right here. We could move more checks into the ethdev
> > >> layer, specifically checks related to the "minimum
> > >>> requirements".
> > >>>
> > >>>> And how it will help let say with bonding PMD situation, or with
> > TX-
> > >> ing of the
> > >>>> same packet over 2 different NICs?
> > >>>
> > >>> The bonding driver is broken.
> > >>> It can only be fixed by not violating the egress postcondition in
> > >> either tx_burst() or tx_prepare().
> > >>> "Minimum requirements" might help doing that.
> > >>>
> > >>>>
> > >>>>> If we don't standardize the meaning of the offload flags, the
> > >> application
> > >>>> developers cannot trust them!
> > >>>>> I'm afraid this is the current situation - application developers
> > >> either
> > >>>> test with specific NIC hardware, or don't use the offload features.
> > >>>>
> > >>>> Well, I have used TX offloads through several projects, it worked
> > >> quite well.
> > >>>
> > >>> That is good to hear.
> > >>> And I don't oppose to that.
> > >>>
> > >>> In this discussion, I am worried about the roadmap direction for
> > DPDK.
> > >>> I oppose to the concept of requiring calling tx_prepare() before
> > >> calling tx_burst() when using offload. I think it is conceptually
> > wrong,
> > >>> and breaks the egress postcondition.
> > >>> I propose "minimum requirements" as a better solution.
> > >>>
> > >>>> Though have to admit, never have to use TX offloads together with
> > >> our bonding
> > >>>> PMD.
> > >>>>
> > >


  reply	other threads:[~2024-04-16  9:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240405125039.897933-1-david.marchand@redhat.com>
2024-04-05 12:49 ` [PATCH " David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 12:49 ` [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
     [not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
2024-04-05 14:45   ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 16:20     ` Morten Brørup
2024-04-08 10:12       ` David Marchand
2024-04-09 13:38       ` Konstantin Ananyev
2024-04-09 14:44         ` Morten Brørup
2024-04-10 10:35           ` Konstantin Ananyev
2024-04-10 12:20             ` Morten Brørup
2024-04-12 12:46               ` Konstantin Ananyev
2024-04-12 14:44                 ` Morten Brørup
2024-04-12 15:17                   ` Konstantin Ananyev
2024-04-12 15:54                     ` Morten Brørup
2024-04-16  9:16                       ` Konstantin Ananyev
2024-04-16 11:36                         ` Konstantin Ananyev
2024-04-15 15:07                   ` Ferruh Yigit
2024-04-16  7:14                     ` Morten Brørup
2024-04-16  9:26                       ` Konstantin Ananyev [this message]
2024-04-05 14:45   ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 14:45   ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:46   ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 14:46   ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
     [not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
2024-04-18  8:20   ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-18  8:20   ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-18  8:20   ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-18  8:20   ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=151ec53014f64068b07adb3645ba6da8@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=aman.deep.singh@intel.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@amd.com \
    --cc=jerinj@marvell.com \
    --cc=jijiang.liu@intel.com \
    --cc=kaiwenx.deng@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    --cc=qiming.yang@intel.com \
    --cc=stable@dpdk.org \
    --cc=thomas@monjalon.net \
    --cc=yidingx.zhou@intel.com \
    --cc=yuying.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).