* MTU and frame size filtering inaccuracy
@ 2022-03-01 17:50 Ferruh Yigit
2022-03-02 8:53 ` Morten Brørup
0 siblings, 1 reply; 5+ messages in thread
From: Ferruh Yigit @ 2022-03-01 17:50 UTC (permalink / raw)
To: dev
Cc: Thomas Monjalon, Andrew Rybchenko, matan, Qi Zhang,
Ajit Khaparde, Stephen Hemminger, Ray Kinsella, Bruce Richardson,
Damjan Marion (damarion),
Roy Fan Zhang, Morten Brørup, Min Hu (Connor),
Konstantin Ananyev, Stokes, Ian, David Marchand
Hi all,
There is a problem in MTU setting in DPDK.
In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
converted to frame size.
Since L2 protocol header size changes based on what HW supports,
L2 overhead information get from PMD, but this still doesn't solve
the issue.
PMD reports max overhead based on what it supports, but there is
no way to know what will received packets have. Sample:
i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
when MTU set to 1500, configured frame size become 1526
When a packet received with no VLAN tag and 1504 bytes payload,
packet frame size is 1522 bytes and it is accepted.
So although MTU is set 1500 bytes, packet with 1504 bytes is accepted.
There is an inaccuracy in frame size filtering up to 8 bytes.
Damjan reported the same, and he has good point on the application
need (I hope it is OK to quote from his email):
1) information about the biggest l2 frame interface it can receive and send (1518,1522, 2000 or jumbo)
2) ability to ask hardware to help him with filtering oversized frames
We need to fix (2), I am not quite sure how, any comment is welcome.
--
Thanks,
ferruh
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: MTU and frame size filtering inaccuracy
2022-03-01 17:50 MTU and frame size filtering inaccuracy Ferruh Yigit
@ 2022-03-02 8:53 ` Morten Brørup
2022-03-02 16:21 ` Stephen Hemminger
0 siblings, 1 reply; 5+ messages in thread
From: Morten Brørup @ 2022-03-02 8:53 UTC (permalink / raw)
To: Ferruh Yigit, dev
Cc: Thomas Monjalon, Andrew Rybchenko, matan, Qi Zhang,
Ajit Khaparde, Stephen Hemminger, Ray Kinsella, Bruce Richardson,
Damjan Marion (damarion), Roy Fan Zhang, Min Hu (Connor),
Konstantin Ananyev, Stokes, Ian, David Marchand
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Tuesday, 1 March 2022 18.50
>
> Hi all,
>
> There is a problem in MTU setting in DPDK.
Yes, and the root cause is the unclear definition of what "MTU" means in DPDK! This is causing the confusion about L3 packet size, L2 raw packet size, and L2 encapsulated packet size.
Traditional Ethernet links are expected to provide a 1500 byte L3 MTU. This means that an untagged packet can be 1518 byte (incl. 14 byte Ethernet header and 4 byte Ethernet CRC), a VLAN tagged packet can be 1522 byte, a QinQ tagged packet can be 1526 byte, and MPLS tagged packets can be other sizes, depending on the number of MPLS labels.
Optimally, the NIC hardware would understand these additional headers and determine if the packet is oversized or not, e.g. on a hybrid link (i.e. mixed untagged and VLAN tagged traffic), it should consider a 1522 byte packet oversize if untagged, but correctly sized if VLAN tagged. However, the NIC hardware doesn't do this.
The above only describes the problem of converting between the L3 and L2 packet size - i.e. the logical packet sizes. There is also a physical limitation:
The NIC hardware might support a certain maximum raw L2 packet size, such as 1522 byte or 2048 byte. In this case, you don't want to allow larger packets regardless of the number of VLAN tags or MPLS labels preceding the actual packet. You could even risk allocating too small MBUFs.
In summary, I think the whole MTU handling API is utterly defective.
Optimally, the API should discriminate between maximum encapsulated L2 packet size (i.e. not counting the bytes used for VLAN tags and similar) and maximum raw L2 packet size (i.e. also counting bytes used for VLAN tags and similar).
When this was discussed on the DPDK mailing list a couple of years ago [1], there was no support for improving on this situation, and the decision was to blindly adopt Linux' way of handling it: Consider the MTU as if packets are untagged, and allow 4 more byte for single VLAN tagged packets. I don't recall exactly how QinQ tagged packets are supposed to be considered regarding the MTU, and I also don't know where any of this is documented.
[1] http://inbox.dpdk.org/dev/MN2PR18MB2432526A39C6ECEB2CEB8865AFE00@MN2PR18MB2432.namprd18.prod.outlook.com/
>
> In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
> converted to frame size.
>
> Since L2 protocol header size changes based on what HW supports,
> L2 overhead information get from PMD, but this still doesn't solve
> the issue.
>
> PMD reports max overhead based on what it supports, but there is
> no way to know what will received packets have. Sample:
>
> i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
> when MTU set to 1500, configured frame size become 1526
> When a packet received with no VLAN tag and 1504 bytes payload,
> packet frame size is 1522 bytes and it is accepted.
> So although MTU is set 1500 bytes, packet with 1504 bytes is accepted.
>
> There is an inaccuracy in frame size filtering up to 8 bytes.
>
>
> Damjan reported the same, and he has good point on the application
> need (I hope it is OK to quote from his email):
>
> 1) information about the biggest l2 frame interface it can receive and
> send (1518,1522, 2000 or jumbo)
Yes, I think the API should report the "maximum raw L2 packet size" (i.e. also counting the bytes used for any preceding tags, regardless if they are stripped or not).
> 2) ability to ask hardware to help him with filtering oversized frames
>
>
> We need to fix (2), I am not quite sure how, any comment is welcome.
This would require NIC hardware support and optionally the addition of a NIC configuration flag to control whether it should count the bytes used by any preceding VLAN tags and/or MPLS labels when evaluating the packet size or not.
The short term solution is a workaround in the application: Configure the NICs with an oversize MTU (e.g. +8 byte to support QinQ packets) and check the packets for oversize in the application. Unfortunately, this also means that the NIC hardware counters are no longer correct, and the reported counters must be adjusted for the number of oversize packets detected by the application.
>
>
> --
> Thanks,
> ferruh
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MTU and frame size filtering inaccuracy
2022-03-02 8:53 ` Morten Brørup
@ 2022-03-02 16:21 ` Stephen Hemminger
2022-03-02 16:50 ` Morten Brørup
0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2022-03-02 16:21 UTC (permalink / raw)
To: Morten Brørup
Cc: Ferruh Yigit, dev, Thomas Monjalon, Andrew Rybchenko, matan,
Qi Zhang, Ajit Khaparde, Ray Kinsella, Bruce Richardson,
Damjan Marion (damarion), Roy Fan Zhang, Min Hu (Connor),
Konstantin Ananyev, Stokes, Ian, David Marchand
On Wed, 2 Mar 2022 09:53:42 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:
> > From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> > Sent: Tuesday, 1 March 2022 18.50
> >
> > Hi all,
> >
> > There is a problem in MTU setting in DPDK.
>
> Yes, and the root cause is the unclear definition of what "MTU" means in DPDK! This is causing the confusion about L3 packet size, L2 raw packet size, and L2 encapsulated packet size.
>
> Traditional Ethernet links are expected to provide a 1500 byte L3 MTU. This means that an untagged packet can be 1518 byte (incl. 14 byte Ethernet header and 4 byte Ethernet CRC), a VLAN tagged packet can be 1522 byte, a QinQ tagged packet can be 1526 byte, and MPLS tagged packets can be other sizes, depending on the number of MPLS labels.
>
> Optimally, the NIC hardware would understand these additional headers and determine if the packet is oversized or not, e.g. on a hybrid link (i.e. mixed untagged and VLAN tagged traffic), it should consider a 1522 byte packet oversize if untagged, but correctly sized if VLAN tagged. However, the NIC hardware doesn't do this.
>
> The above only describes the problem of converting between the L3 and L2 packet size - i.e. the logical packet sizes. There is also a physical limitation:
>
> The NIC hardware might support a certain maximum raw L2 packet size, such as 1522 byte or 2048 byte. In this case, you don't want to allow larger packets regardless of the number of VLAN tags or MPLS labels preceding the actual packet. You could even risk allocating too small MBUFs.
>
> In summary, I think the whole MTU handling API is utterly defective.
>
> Optimally, the API should discriminate between maximum encapsulated L2 packet size (i.e. not counting the bytes used for VLAN tags and similar) and maximum raw L2 packet size (i.e. also counting bytes used for VLAN tags and similar).
>
> When this was discussed on the DPDK mailing list a couple of years ago [1], there was no support for improving on this situation, and the decision was to blindly adopt Linux' way of handling it: Consider the MTU as if packets are untagged, and allow 4 more byte for single VLAN tagged packets. I don't recall exactly how QinQ tagged packets are supposed to be considered regarding the MTU, and I also don't know where any of this is documented.
>
> [1] http://inbox.dpdk.org/dev/MN2PR18MB2432526A39C6ECEB2CEB8865AFE00@MN2PR18MB2432.namprd18.prod.outlook.com/
>
> >
> > In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
> > converted to frame size.
> >
> > Since L2 protocol header size changes based on what HW supports,
> > L2 overhead information get from PMD, but this still doesn't solve
> > the issue.
> >
> > PMD reports max overhead based on what it supports, but there is
> > no way to know what will received packets have. Sample:
> >
> > i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
> > when MTU set to 1500, configured frame size become 1526
> > When a packet received with no VLAN tag and 1504 bytes payload,
> > packet frame size is 1522 bytes and it is accepted.
> > So although MTU is set 1500 bytes, packet with 1504 bytes is accepted.
> >
> > There is an inaccuracy in frame size filtering up to 8 bytes.
> >
> >
> > Damjan reported the same, and he has good point on the application
> > need (I hope it is OK to quote from his email):
> >
> > 1) information about the biggest l2 frame interface it can receive and
> > send (1518,1522, 2000 or jumbo)
>
> Yes, I think the API should report the "maximum raw L2 packet size" (i.e. also counting the bytes used for any preceding tags, regardless if they are stripped or not).
>
> > 2) ability to ask hardware to help him with filtering oversized frames
> >
> >
> > We need to fix (2), I am not quite sure how, any comment is welcome.
>
> This would require NIC hardware support and optionally the addition of a NIC configuration flag to control whether it should count the bytes used by any preceding VLAN tags and/or MPLS labels when evaluating the packet size or not.
>
> The short term solution is a workaround in the application: Configure the NICs with an oversize MTU (e.g. +8 byte to support QinQ packets) and check the packets for oversize in the application. Unfortunately, this also means that the NIC hardware counters are no longer correct, and the reported counters must be adjusted for the number of oversize packets detected by the application.
MTU is often a confusing term. Ideally there would be Max Receive Unit and Max Transmit Unit.
I can tell you what Linux (and BSD) kernel do. On transmit MTU is used as filter to size
packets before they are passed to the device driver. Also it is used to tell TSO what size
units to use.
But on receive, in kernel any size packet is allowed! The MTU is used by the hardware to program
receive buffers. Many devices round up to MTU + VLAN to what ever hardware increment they can
handle. Some devices only handle power of 2 which is why E1000 allows 2K packets to come in when
there is a 1500 byte MTU.
The other source of confusion is around MTU and VLAN's and encaps. DPDK should be doing what
other OS's and most network vendors do.
The convention is that the outer VLAN tag is not part of the MTU but any other tags and encaps subtract from the usable MTU. I.e with MTU = 1500 and QinQ the usable MTU is 1500 - 4 = 1496.
For receive and MTU, DPDK should allow any size coming in that HW can receive.
Postel's Law - Be conservative in what you do, be liberal in what you accept from others.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: MTU and frame size filtering inaccuracy
2022-03-02 16:21 ` Stephen Hemminger
@ 2022-03-02 16:50 ` Morten Brørup
2022-03-02 17:40 ` Stephen Hemminger
0 siblings, 1 reply; 5+ messages in thread
From: Morten Brørup @ 2022-03-02 16:50 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Ferruh Yigit, dev, Thomas Monjalon, Andrew Rybchenko, matan,
Qi Zhang, Ajit Khaparde, Ray Kinsella, Bruce Richardson,
Damjan Marion (damarion), Roy Fan Zhang, Min Hu (Connor),
Konstantin Ananyev, Stokes, Ian, David Marchand
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 2 March 2022 17.22
>
> On Wed, 2 Mar 2022 09:53:42 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > > From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> > > Sent: Tuesday, 1 March 2022 18.50
> > >
> > > Hi all,
> > >
> > > There is a problem in MTU setting in DPDK.
> >
> > Yes, and the root cause is the unclear definition of what "MTU" means
> in DPDK! This is causing the confusion about L3 packet size, L2 raw
> packet size, and L2 encapsulated packet size.
> >
> > Traditional Ethernet links are expected to provide a 1500 byte L3
> MTU. This means that an untagged packet can be 1518 byte (incl. 14 byte
> Ethernet header and 4 byte Ethernet CRC), a VLAN tagged packet can be
> 1522 byte, a QinQ tagged packet can be 1526 byte, and MPLS tagged
> packets can be other sizes, depending on the number of MPLS labels.
> >
> > Optimally, the NIC hardware would understand these additional headers
> and determine if the packet is oversized or not, e.g. on a hybrid link
> (i.e. mixed untagged and VLAN tagged traffic), it should consider a
> 1522 byte packet oversize if untagged, but correctly sized if VLAN
> tagged. However, the NIC hardware doesn't do this.
> >
> > The above only describes the problem of converting between the L3 and
> L2 packet size - i.e. the logical packet sizes. There is also a
> physical limitation:
> >
> > The NIC hardware might support a certain maximum raw L2 packet size,
> such as 1522 byte or 2048 byte. In this case, you don't want to allow
> larger packets regardless of the number of VLAN tags or MPLS labels
> preceding the actual packet. You could even risk allocating too small
> MBUFs.
> >
> > In summary, I think the whole MTU handling API is utterly defective.
> >
> > Optimally, the API should discriminate between maximum encapsulated
> L2 packet size (i.e. not counting the bytes used for VLAN tags and
> similar) and maximum raw L2 packet size (i.e. also counting bytes used
> for VLAN tags and similar).
> >
> > When this was discussed on the DPDK mailing list a couple of years
> ago [1], there was no support for improving on this situation, and the
> decision was to blindly adopt Linux' way of handling it: Consider the
> MTU as if packets are untagged, and allow 4 more byte for single VLAN
> tagged packets. I don't recall exactly how QinQ tagged packets are
> supposed to be considered regarding the MTU, and I also don't know
> where any of this is documented.
> >
> > [1]
> http://inbox.dpdk.org/dev/MN2PR18MB2432526A39C6ECEB2CEB8865AFE00@MN2PR1
> 8MB2432.namprd18.prod.outlook.com/
> >
> > >
> > > In 'rte_eth_dev_configure()'and 'rte_eth_dev_set_mtu()', MTU is
> > > converted to frame size.
> > >
> > > Since L2 protocol header size changes based on what HW supports,
> > > L2 overhead information get from PMD, but this still doesn't solve
> > > the issue.
> > >
> > > PMD reports max overhead based on what it supports, but there is
> > > no way to know what will received packets have. Sample:
> > >
> > > i40e has 26 bytes overhead: HRD_LEN + CRC_LEN + VLAN_LEN *2
> > > when MTU set to 1500, configured frame size become 1526
> > > When a packet received with no VLAN tag and 1504 bytes payload,
> > > packet frame size is 1522 bytes and it is accepted.
> > > So although MTU is set 1500 bytes, packet with 1504 bytes is
> accepted.
> > >
> > > There is an inaccuracy in frame size filtering up to 8 bytes.
> > >
> > >
> > > Damjan reported the same, and he has good point on the application
> > > need (I hope it is OK to quote from his email):
> > >
> > > 1) information about the biggest l2 frame interface it can receive
> and
> > > send (1518,1522, 2000 or jumbo)
> >
> > Yes, I think the API should report the "maximum raw L2 packet size"
> (i.e. also counting the bytes used for any preceding tags, regardless
> if they are stripped or not).
> >
> > > 2) ability to ask hardware to help him with filtering oversized
> frames
> > >
> > >
> > > We need to fix (2), I am not quite sure how, any comment is
> welcome.
> >
> > This would require NIC hardware support and optionally the addition
> of a NIC configuration flag to control whether it should count the
> bytes used by any preceding VLAN tags and/or MPLS labels when
> evaluating the packet size or not.
> >
> > The short term solution is a workaround in the application: Configure
> the NICs with an oversize MTU (e.g. +8 byte to support QinQ packets)
> and check the packets for oversize in the application. Unfortunately,
> this also means that the NIC hardware counters are no longer correct,
> and the reported counters must be adjusted for the number of oversize
> packets detected by the application.
>
> MTU is often a confusing term. Ideally there would be Max Receive Unit
> and Max Transmit Unit.
> I can tell you what Linux (and BSD) kernel do. On transmit MTU is used
> as filter to size
> packets before they are passed to the device driver. Also it is used
> to tell TSO what size
> units to use.
>
> But on receive, in kernel any size packet is allowed!
This makes perfect sense on a server, where traffic is terminated or originated.
However, I'm not so sure that it makes the same level of sense on a router or other network appliance.
> The MTU is used
> by the hardware to program
> receive buffers. Many devices round up to MTU + VLAN to what ever
> hardware increment they can
> handle. Some devices only handle power of 2 which is why E1000 allows
> 2K packets to come in when
> there is a 1500 byte MTU.
>
> The other source of confusion is around MTU and VLAN's and encaps. DPDK
> should be doing what
> other OS's and most network vendors do.
> The convention is that the outer VLAN tag is not part of the MTU but
> any other tags and encaps subtract from the usable MTU. I.e with MTU =
> 1500 and QinQ the usable MTU is 1500 - 4 = 1496.
>
> For receive and MTU, DPDK should allow any size coming in that HW can
> receive.
> Postel's Law - Be conservative in what you do, be liberal in what you
> accept from others.
>
Again, Postel's Law makes perfect sense on a server.
On a network appliance, I would prefer discarding oversize packets at ingress, rather than process them and discard them at egress.
I do admit that my viewpoint is quite academic, and mostly relevant for L2 bridging appliances, where fragmentation is not available. It is probably the best approach for DPDK to align the API with what common NIC hardware supports. In other words, I agree with Stephen. :-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MTU and frame size filtering inaccuracy
2022-03-02 16:50 ` Morten Brørup
@ 2022-03-02 17:40 ` Stephen Hemminger
0 siblings, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2022-03-02 17:40 UTC (permalink / raw)
To: Morten Brørup
Cc: Ferruh Yigit, dev, Thomas Monjalon, Andrew Rybchenko, matan,
Qi Zhang, Ajit Khaparde, Ray Kinsella, Bruce Richardson,
Damjan Marion (damarion), Roy Fan Zhang, Min Hu (Connor),
Konstantin Ananyev, Stokes, Ian, David Marchand
On Wed, 2 Mar 2022 17:50:13 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:
> Again, Postel's Law makes perfect sense on a server.
>
> On a network appliance, I would prefer discarding oversize packets at ingress, rather than process them and discard them at egress.
>
> I do admit that my viewpoint is quite academic, and mostly relevant for L2 bridging appliances, where fragmentation is not available. It is probably the best approach for DPDK to align the API with what common NIC hardware supports. In other words, I agree with Stephen. :-)
The HW should try (if it can) and filter overly large frames, but it is still the responsibility
of router and bridge software to look at MTU of the egress port and take action.
For L2 bridge, that means silently dropping and for L3 that is responding with ICMP.
The DPDK examples don't do this, and they should.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-03-02 17:40 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-01 17:50 MTU and frame size filtering inaccuracy Ferruh Yigit
2022-03-02 8:53 ` Morten Brørup
2022-03-02 16:21 ` Stephen Hemminger
2022-03-02 16:50 ` Morten Brørup
2022-03-02 17:40 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).