DPDK patches and discussions
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: "Ferruh Yigit" <ferruh.yigit@intel.com>, <dev@dpdk.org>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
	<matan@nvidia.com>, "Qi Zhang" <qi.z.zhang@intel.com>,
	"Ajit Khaparde" <ajit.khaparde@broadcom.com>,
	"Ray Kinsella" <mdr@ashroe.eu>,
	"Bruce Richardson" <bruce.richardson@intel.com>,
	"Damjan Marion (damarion)" <damarion@cisco.com>,
	"Roy Fan Zhang" <roy.fan.zhang@intel.com>,
	"Min Hu (Connor)" <humin29@huawei.com>,
	"Konstantin Ananyev" <konstantin.ananyev@intel.com>,
	"Stokes, Ian" <ian.stokes@intel.com>,
	"David Marchand" <david.marchand@redhat.com>
Subject: Re: MTU and frame size filtering inaccuracy
Date: Wed, 2 Mar 2022 08:21:59 -0800	[thread overview]
Message-ID: <20220302082159.06d3d872@hermes.local> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86F14@smartserver.smartshare.dk>

On Wed, 2 Mar 2022 09:53:42 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> > Sent: Tuesday, 1 March 2022 18.50
> > 
> > Hi all,
> > 
> > There is a problem in MTU setting in DPDK.  
> 
> Yes, and the root cause is the unclear definition of what "MTU" means in DPDK! This is causing the confusion about L3 packet size, L2 raw packet size, and L2 encapsulated packet size.
> 
> Traditional Ethernet links are expected to provide a 1500 byte L3 MTU. This means that an untagged packet can be 1518 byte (incl. 14 byte Ethernet header and 4 byte Ethernet CRC), a VLAN tagged packet can be 1522 byte, a QinQ tagged packet can be 1526 byte, and MPLS tagged packets can be other sizes, depending on the number of MPLS labels.
> 
> Optimally, the NIC hardware would understand these additional headers and determine if the packet is oversized or not, e.g. on a hybrid link (i.e. mixed untagged and VLAN tagged traffic), it should consider a 1522 byte packet oversize if untagged, but correctly sized if VLAN tagged. However, the NIC hardware doesn't do this.
> 
> The above only describes the problem of converting between the L3 and L2 packet size - i.e. the logical packet sizes. There is also a physical limitation:
> 
> The NIC hardware might support a certain maximum raw L2 packet size, such as 1522 byte or 2048 byte. In this case, you don't want to allow larger packets regardless of the number of VLAN tags or MPLS labels preceding the actual packet. You could even risk allocating too small MBUFs.
> 
> In summary, I think the whole MTU handling API is utterly defective.
> 
> Optimally, the API should discriminate between maximum encapsulated L2 packet size (i.e. not counting the bytes used for VLAN tags and similar) and maximum raw L2 packet size (i.e. also counting bytes used for VLAN tags and similar).
> 
> When this was discussed on the DPDK mailing list a couple of years ago [1], there was no support for improving on this situation, and the decision was to blindly adopt Linux' way of handling it: Consider the MTU as if packets are untagged, and allow 4 more byte for single VLAN tagged packets. I don't recall exactly how QinQ tagged packets are supposed to be considered regarding the MTU, and I also don't know where any of this is documented.
> 
> [1] http://inbox.dpdk.org/dev/MN2PR18MB2432526A39C6ECEB2CEB8865AFE00@MN2PR18MB2432.namprd18.prod.outlook.com/
> 
> > 
> > In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
> > converted to frame size.
> > 
> > Since L2 protocol header size changes based on what HW supports,
> > L2 overhead information get from PMD, but this still doesn't solve
> > the issue.
> > 
> > PMD reports max overhead based on what it supports, but there is
> > no way to know what will received packets have. Sample:
> > 
> > i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
> > when MTU set to 1500, configured frame size become 1526
> > When a packet received with no VLAN tag and 1504 bytes payload,
> > packet frame size is 1522 bytes and it is accepted.
> > So although MTU is set 1500 bytes, packet with 1504 bytes is accepted.
> > 
> > There is an inaccuracy in frame size filtering up to 8 bytes.
> > 
> > 
> > Damjan reported the same, and he has good point on the application
> > need (I hope it is OK to quote from his email):
> > 
> > 1) information about the biggest l2 frame interface it can receive and
> > send (1518,1522, 2000 or jumbo)  
> 
> Yes, I think the API should report the "maximum raw L2 packet size" (i.e. also counting the bytes used for any preceding tags, regardless if they are stripped or not).
> 
> > 2) ability to ask hardware to help him with filtering oversized frames
> > 
> > 
> > We need to fix (2), I am not quite sure how, any comment is welcome.  
> 
> This would require NIC hardware support and optionally the addition of a NIC configuration flag to control whether it should count the bytes used by any preceding VLAN tags and/or MPLS labels when evaluating the packet size or not.
> 
> The short term solution is a workaround in the application: Configure the NICs with an oversize MTU (e.g. +8 byte to support QinQ packets) and check the packets for oversize in the application. Unfortunately, this also means that the NIC hardware counters are no longer correct, and the reported counters must be adjusted for the number of oversize packets detected by the application.

MTU is often a confusing term. Ideally there would be Max Receive Unit and Max Transmit Unit.
I can tell you what Linux (and BSD) kernel do. On transmit MTU is used as filter to size
packets before they are passed to the device driver.  Also it is used to tell TSO what size
units to use.

But on receive, in kernel any size packet is allowed!  The MTU is used by the hardware to program
receive buffers.  Many devices round up to MTU + VLAN to what ever hardware increment they can
handle. Some devices only handle power of 2 which is why E1000 allows 2K packets to come in when
there is a 1500 byte MTU.

The other source of confusion is around MTU and VLAN's and encaps. DPDK should be doing what
other OS's and most network vendors do.
The convention is that the outer VLAN tag is not part of the MTU but any other tags and encaps subtract from the usable MTU.  I.e with MTU = 1500 and QinQ the usable MTU is 1500 - 4 = 1496. 

For receive and MTU, DPDK should allow any size coming in that HW can receive.
Postel's Law - Be conservative in what you do, be liberal in what you accept from others.


  reply	other threads:[~2022-03-02 16:22 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-01 17:50 Ferruh Yigit
2022-03-02  8:53 ` Morten Brørup
2022-03-02 16:21   ` Stephen Hemminger [this message]
2022-03-02 16:50     ` Morten Brørup
2022-03-02 17:40       ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220302082159.06d3d872@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=ajit.khaparde@broadcom.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=damarion@cisco.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=humin29@huawei.com \
    --cc=ian.stokes@intel.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=matan@nvidia.com \
    --cc=mb@smartsharesystems.com \
    --cc=mdr@ashroe.eu \
    --cc=qi.z.zhang@intel.com \
    --cc=roy.fan.zhang@intel.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).