From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0A931A04A4;
	Wed,  2 Mar 2022 17:22:06 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 9B63D42715;
	Wed,  2 Mar 2022 17:22:05 +0100 (CET)
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com
 [209.85.215.182])
 by mails.dpdk.org (Postfix) with ESMTP id 0BA2440141
 for <dev@dpdk.org>; Wed,  2 Mar 2022 17:22:04 +0100 (CET)
Received: by mail-pg1-f182.google.com with SMTP id 27so2043884pgk.10
 for <dev@dpdk.org>; Wed, 02 Mar 2022 08:22:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=networkplumber-org.20210112.gappssmtp.com; s=20210112;
 h=date:from:to:cc:subject:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=cQJHimD+Z0ZHI0Ggnbz4t/6heFiWbbOseMCb9KdSdDw=;
 b=tiKpzlvXKl/qGM1nqMktofbDkbiMBhRhWvhnPE2X3N+EAirt8irzyKP4OddPCOeisy
 RqsbjucEtGc4anjp22hiyV8rFVsDQYi84NfGHzEikSG9rmNfco/Mr/p0cgxHbN9UBE1Q
 N8VR3dNhlXXqKYU951R0bK4rlEoPFzD6CvADi9OncyXYuIU8/sy4wFQFdnIH8a8bvNRX
 Y5T4yZ/x30pbTISaPQQDRn8pIOja8++TEJYGCcaZ8Kma8uXuXqpfYo6DrPZzNLO6caJx
 s2isTHY5USNmZiU/SZFN4mneI22AeUvSJ8P65JO1/smgD1R9u7npZrrSYCkGQoqRLEVK
 6tKw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=cQJHimD+Z0ZHI0Ggnbz4t/6heFiWbbOseMCb9KdSdDw=;
 b=jjbaCuTYlVmwhLP4VZUxBDdVX38dBMFc2CcJM4yKj6bWD5RfMCSCXQyh/Oq76pmmlA
 /Dxjcq1X1kQeYSjfgzruLh8L4bGlru+vwHPxMlekmZY/cOfOeY+Vzo22+6Qv4Jz6hHBc
 0yu/Eu0mqadqhpQVKxwUhmeBGbxT0VzI0eJx+hTp7WGFc1r1+YgODVv/PMbr5ZMRaEp5
 P1LGPP9myVrQt973gfl7LuUxENrS5ihGrLGKR6XCbRa3vVmrBZJOI1tb3vCQTYA1sL0I
 7p80G1ycgjqepJJlkvgCk38Mu7jQXUyhow+lZo+7M1VXwxB4E7d/f1fzYNXqqsjmueim
 giOA==
X-Gm-Message-State: AOAM533YrpVyq/y+tGp8HsVKxULSUGDRvY9cduH+E7AsvCA5jq08iFRf
 WhoEi/5ln0CSYf9ndedJQytqcg==
X-Google-Smtp-Source: ABdhPJxUA1O0ey309d0Crd7wd8Y2LwUrZTNGCRCZnj6pkzKehbphtSjJ9BXqt3qdXG+E6+F1Ykjbww==
X-Received: by 2002:a05:6a00:1a8f:b0:4e1:cde3:7bf7 with SMTP id
 e15-20020a056a001a8f00b004e1cde37bf7mr34223301pfv.52.1646238123045; 
 Wed, 02 Mar 2022 08:22:03 -0800 (PST)
Received: from hermes.local (204-195-112-199.wavecable.com. [204.195.112.199])
 by smtp.gmail.com with ESMTPSA id
 v66-20020a622f45000000b004f129e7767fsm20508764pfv.130.2022.03.02.08.22.01
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 02 Mar 2022 08:22:02 -0800 (PST)
Date: Wed, 2 Mar 2022 08:21:59 -0800
From: Stephen Hemminger <stephen@networkplumber.org>
To: Morten =?UTF-8?B?QnLDuHJ1cA==?= <mb@smartsharesystems.com>
Cc: "Ferruh Yigit" <ferruh.yigit@intel.com>, <dev@dpdk.org>, "Thomas
 Monjalon" <thomas@monjalon.net>, "Andrew Rybchenko"
 <andrew.rybchenko@oktetlabs.ru>, <matan@nvidia.com>, "Qi Zhang"
 <qi.z.zhang@intel.com>, "Ajit Khaparde" <ajit.khaparde@broadcom.com>, "Ray
 Kinsella" <mdr@ashroe.eu>, "Bruce Richardson" <bruce.richardson@intel.com>,
 "Damjan Marion (damarion)" <damarion@cisco.com>, "Roy Fan Zhang"
 <roy.fan.zhang@intel.com>, "Min Hu (Connor)" <humin29@huawei.com>,
 "Konstantin Ananyev" <konstantin.ananyev@intel.com>, "Stokes, Ian"
 <ian.stokes@intel.com>, "David Marchand" <david.marchand@redhat.com>
Subject: Re: MTU and frame size filtering inaccuracy
Message-ID: <20220302082159.06d3d872@hermes.local>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86F14@smartserver.smartshare.dk>
References: <e2554b78-cdda-aa33-ac6d-59a543a10640@intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86F14@smartserver.smartshare.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Wed, 2 Mar 2022 09:53:42 +0100
Morten Br=C3=B8rup <mb@smartsharesystems.com> wrote:

> > From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> > Sent: Tuesday, 1 March 2022 18.50
> >=20
> > Hi all,
> >=20
> > There is a problem in MTU setting in DPDK. =20
>=20
> Yes, and the root cause is the unclear definition of what "MTU" means in =
DPDK! This is causing the confusion about L3 packet size, L2 raw packet siz=
e, and L2 encapsulated packet size.
>=20
> Traditional Ethernet links are expected to provide a 1500 byte L3 MTU. Th=
is means that an untagged packet can be 1518 byte (incl. 14 byte Ethernet h=
eader and 4 byte Ethernet CRC), a VLAN tagged packet can be 1522 byte, a Qi=
nQ tagged packet can be 1526 byte, and MPLS tagged packets can be other siz=
es, depending on the number of MPLS labels.
>=20
> Optimally, the NIC hardware would understand these additional headers and=
 determine if the packet is oversized or not, e.g. on a hybrid link (i.e. m=
ixed untagged and VLAN tagged traffic), it should consider a 1522 byte pack=
et oversize if untagged, but correctly sized if VLAN tagged. However, the N=
IC hardware doesn't do this.
>=20
> The above only describes the problem of converting between the L3 and L2 =
packet size - i.e. the logical packet sizes. There is also a physical limit=
ation:
>=20
> The NIC hardware might support a certain maximum raw L2 packet size, such=
 as 1522 byte or 2048 byte. In this case, you don't want to allow larger pa=
ckets regardless of the number of VLAN tags or MPLS labels preceding the ac=
tual packet. You could even risk allocating too small MBUFs.
>=20
> In summary, I think the whole MTU handling API is utterly defective.
>=20
> Optimally, the API should discriminate between maximum encapsulated L2 pa=
cket size (i.e. not counting the bytes used for VLAN tags and similar) and =
maximum raw L2 packet size (i.e. also counting bytes used for VLAN tags and=
 similar).
>=20
> When this was discussed on the DPDK mailing list a couple of years ago [1=
], there was no support for improving on this situation, and the decision w=
as to blindly adopt Linux' way of handling it: Consider the MTU as if packe=
ts are untagged, and allow 4 more byte for single VLAN tagged packets. I do=
n't recall exactly how QinQ tagged packets are supposed to be considered re=
garding the MTU, and I also don't know where any of this is documented.
>=20
> [1] http://inbox.dpdk.org/dev/MN2PR18MB2432526A39C6ECEB2CEB8865AFE00@MN2P=
R18MB2432.namprd18.prod.outlook.com/
>=20
> >=20
> > In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
> > converted to frame size.
> >=20
> > Since L2 protocol header size changes based on what HW supports,
> > L2 overhead information get from PMD, but this still doesn't solve
> > the issue.
> >=20
> > PMD reports max overhead based on what it supports, but there is
> > no way to know what will received packets have. Sample:
> >=20
> > i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
> > when MTU set to 1500, configured frame size become 1526
> > When a packet received with no VLAN tag and 1504 bytes payload,
> > packet frame size is 1522 bytes and it is accepted.
> > So although MTU is set 1500 bytes, packet with 1504 bytes is accepted.
> >=20
> > There is an inaccuracy in frame size filtering up to 8 bytes.
> >=20
> >=20
> > Damjan reported the same, and he has good point on the application
> > need (I hope it is OK to quote from his email):
> >=20
> > 1) information about the biggest l2 frame interface it can receive and
> > send (1518,1522, 2000 or jumbo) =20
>=20
> Yes, I think the API should report the "maximum raw L2 packet size" (i.e.=
 also counting the bytes used for any preceding tags, regardless if they ar=
e stripped or not).
>=20
> > 2) ability to ask hardware to help him with filtering oversized frames
> >=20
> >=20
> > We need to fix (2), I am not quite sure how, any comment is welcome. =20
>=20
> This would require NIC hardware support and optionally the addition of a =
NIC configuration flag to control whether it should count the bytes used by=
 any preceding VLAN tags and/or MPLS labels when evaluating the packet size=
 or not.
>=20
> The short term solution is a workaround in the application: Configure the=
 NICs with an oversize MTU (e.g. +8 byte to support QinQ packets) and check=
 the packets for oversize in the application. Unfortunately, this also mean=
s that the NIC hardware counters are no longer correct, and the reported co=
unters must be adjusted for the number of oversize packets detected by the =
application.

MTU is often a confusing term. Ideally there would be Max Receive Unit and =
Max Transmit Unit.
I can tell you what Linux (and BSD) kernel do. On transmit MTU is used as f=
ilter to size
packets before they are passed to the device driver.  Also it is used to te=
ll TSO what size
units to use.

But on receive, in kernel any size packet is allowed!  The MTU is used by t=
he hardware to program
receive buffers.  Many devices round up to MTU + VLAN to what ever hardware=
 increment they can
handle. Some devices only handle power of 2 which is why E1000 allows 2K pa=
ckets to come in when
there is a 1500 byte MTU.

The other source of confusion is around MTU and VLAN's and encaps. DPDK sho=
uld be doing what
other OS's and most network vendors do.
The convention is that the outer VLAN tag is not part of the MTU but any ot=
her tags and encaps subtract from the usable MTU.  I.e with MTU =3D 1500 an=
d QinQ the usable MTU is 1500 - 4 =3D 1496.=20

For receive and MTU, DPDK should allow any size coming in that HW can recei=
ve.
Postel's Law - Be conservative in what you do, be liberal in what you accep=
t from others.