Something just struck me…

The buffer address field in the RX descriptor of some NICs may have alignment requirements, i.e. the lowest bits in the buffer address field of the NIC’s RX descriptor may be used for other purposes (and assumed zero for buffer address purposes). 40 is divisible by 8, but offset 20 requires that the NIC hardware supports 4-byte aligned addresses (so only the 2 lowest bits may be used for other purposes).

 

Here’s an example of what I mean:

https://docs.amd.com/r/en-US/am011-versal-acap-trm/RX-Descriptor-Words <https://docs.amd.com/r/en-US/am011-versal-acap-trm/RX-Descriptor-Words> 

 

If any of your supported NICs have that restriction, i.e. requires an 8 byte aligned buffer address, your concept of having the UDP payload at the same fixed offset for both IPv4 and IPv6 is not going to be possible. (And you were lucky that the offset happens to be sufficiently aligned to work for IPv4 to begin with.)

 

It seems you need to read a bunch of datasheets before proceeding.

 

 

Med venlig hilsen / Kind regards,

-Morten Brørup

 

From: Garrett D'Amore [mailto:garrett@damore.org] 
Sent: Tuesday, 26 March 2024 15.19



This could work. Not that we would like to have the exceptional case of IPv6 use less headroom.   So we would say 40 is our compiled in default and then we reduce it by 20 on IPv6 which doesn’t have to support all the same devices that IPv4 does. This would give the lowest disruption to the existing IPv4 stack and allow PMDs to updated incrementally. 

On Mar 26, 2024 at 1:05 AM -0700, Morten Brørup <mb@smartsharesystems.com>, wrote:



Interesting requirement. I can easily imagine how a (non-forwarding, i.e. traffic terminating) application, which doesn’t really care about the preceding headers, can benefit from having its actual data at a specific offset for alignment purposes. I don’t consider this very exotic. (Even the Linux kernel uses this trick to achieve improved IP header alignment on RX.)

 

I think the proper solution would be to add a new offload parameter to rte_eth_rxconf to specify how many bytes the driver should subtract from RTE_PKTMBUF_HEADROOM when writing the RX descriptor to the NIC hardware. Depending on driver support, this would make it configurable per device and per RX queue.

 

If this parameter is set, the driver should adjust m->data_off accordingly on RX, so rte_pktmbuf_mtod[_offset]() and rte_pktmbuf_iova[_offset]() still point to the Ethernet header.

 

 

Med venlig hilsen / Kind regards,

-Morten Brørup

 

From: Garrett D'Amore [mailto:garrett@damore.org]
Sent: Monday, 25 March 2024 23.56

So we need (for reasons that I don't want to get to into in too much detail) that our UDP payload headers are at a specific offset in the packet.

This was not a problem as long as we only used IPv4.  (We have configured 40 bytes of headroom, which is more than any of our PMDs need by a hefty margin.)

Now that we're extending to support IPv6, we need to reduce that headroom by 20 bytes, to preserve our UDP payload offset.

This has big ramifications for how we fragment our own upper layer messages, and it has been determined that updating the PMDs to allow us to change the headroom for this use case (on a per port basis, as we will have some ports on IPv4 and others on IPv6) is the least effort, but a large margin.  (Well, copying the frames via memcpy would be less development effort, but would be a performance catastrophe.)

For transmit side we don't need this, as we can simply adjust the packet as needed.  But for the receive side, we are kind of stuck, as the PMDs rely on the hard coded RTE_PKTMBUF_HEADROOM to program receive locations.

As far as header splitting, that would indeed be a much much nicer solution.

I haven't looked in the latest code to see if header splitting is even an option -- the version of the DPDK I'm working with is a little older (20.11) -- we have to update but we have other local changes and so updating is one of the things that we still have to do.

At any rate, the version I did look at doesn't seem to support header splits on any device other than FM10K.  That's not terrifically interesting for us.  We use Mellanox, E810 (ICE), bnxt, cloud NICs (all of them really -- ENA, virtio-net, etc.)   We also have a fair amount of ixgbe and i40e on client systems in the field.

We also, unfortunately, have an older DPDK 18 with Mellanox contributions for IPoverIB.... though I'm not sure we will try to support IPv6 there.  (We are working towards replacing that part of stack with UCX.)

Unless header splitting will work on all of this (excepting the IPoIB piece), then it's not something we can really use.

On Mar 25, 2024 at 10:20 AM -0700, Stephen Hemminger <stephen@networkplumber.org>, wrote:

On Mon, 25 Mar 2024 10:01:52 +0000
Bruce Richardson <bruce.richardson@intel.com> wrote:



On Sat, Mar 23, 2024 at 01:51:25PM -0700, Garrett D'Amore wrote:

> So we right now (at WEKA) have a somewhat older version of DPDK that we
> have customized heavily, and I am going to to need to to make the
> headroom *dynamic* (passed in at run time, and per port.)
> We have this requirement because we need payload to be at a specific
> offset, but have to deal with different header lengths for IPv4 and now
> IPv6.
> My reason for pointing this out, is that I would dearly like if we
> could collaborate on this -- this change is going to touch pretty much
> every PMD (we don't need it on all of them as we only support a subset
> of PMDs, but its still a significant set.)
> I'm not sure if anyone else has considered such a need -- this
> particular message caught my eye as I'm looking specifically in this
> area right now.
>

Hi

thanks for reaching out. Can you clarify a little more as to the need for
this requirement? Can you not just set the headroom value to the max needed
value for any port and use that? Is there an issue with having blank space
at the start of a buffer?

Thanks,
/Bruce


If you have to make such a deep change across all PMD's then maybe
it is not the best solution. What about being able to do some form of buffer
chaining or pullup.