* [PATCH 3/8] mbuf: fix Tx checksum offload examples
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
@ 2024-04-05 12:49 ` David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
` (5 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 12:49 UTC (permalink / raw)
To: dev; +Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu
Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
examples.
Remove unneeded resetting of checksums to align with the mbuf
API doxygen.
Clarify the case when requesting "inner" checksum offload with
lack of outer L4 checksum offload.
Fixes: f00dcb7b0a74 ("mbuf: fix Tx checksum offload API doc")
Fixes: 609dd68ef14f ("mbuf: enhance the API documentation of offload flags")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
doc/guides/prog_guide/mbuf_lib.rst | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/doc/guides/prog_guide/mbuf_lib.rst b/doc/guides/prog_guide/mbuf_lib.rst
index 049357c755..4e285c0aab 100644
--- a/doc/guides/prog_guide/mbuf_lib.rst
+++ b/doc/guides/prog_guide/mbuf_lib.rst
@@ -126,6 +126,9 @@ processing to the hardware if it supports it. For instance, the
RTE_MBUF_F_TX_IP_CKSUM flag allows to offload the computation of the IPv4
checksum.
+Support for such processing by the hardware is advertised through RTE_ETH_TX_OFFLOAD_* capabilities.
+Please note that a call to ``rte_eth_tx_prepare`` is needed to handle driver specific requirements such as resetting some checksum fields.
+
The following examples explain how to configure different TX offloads on
a vxlan-encapsulated tcp packet:
``out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload``
@@ -135,7 +138,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM
- set out_ip checksum to 0 in the packet
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
@@ -144,8 +146,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM | RTE_MBUF_F_TX_UDP_CKSUM
- set out_ip checksum to 0 in the packet
- set out_udp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
and RTE_ETH_TX_OFFLOAD_UDP_CKSUM.
@@ -155,7 +155,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM
- set in_ip checksum to 0 in the packet
This is similar to case 1), but l2_len is different. It is supported
on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
@@ -166,8 +165,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM | RTE_MBUF_F_TX_TCP_CKSUM
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is similar to case 2), but l2_len is different. It is supported
on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM and
@@ -181,9 +178,6 @@ a vxlan-encapsulated tcp packet:
mb->l4_len = len(in_tcp)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM |
RTE_MBUF_F_TX_TCP_SEG;
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header without including the IP
- payload length using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_TCP_TSO.
Note that it can only work if outer L4 checksum is 0.
@@ -196,12 +190,10 @@ a vxlan-encapsulated tcp packet:
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IP_CKSUM | \
RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM;
- set out_ip checksum to 0 in the packet
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM,
RTE_ETH_TX_OFFLOAD_UDP_CKSUM and RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM.
+ Note that it can only work if outer L4 checksum is 0.
The list of flags and their precise meaning is described in the mbuf API
documentation (rte_mbuf.h). Also refer to the testpmd source code
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 4/8] app/testpmd: fix outer IP checksum offload
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
2024-04-05 12:49 ` [PATCH 3/8] mbuf: fix Tx checksum offload examples David Marchand
@ 2024-04-05 12:49 ` David Marchand
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
` (4 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 12:49 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang,
Olivier Matz, Tomasz Kulasek, Konstantin Ananyev
Resetting the outer IP checksum to 0 is not something mandated by the
mbuf API and is done by rte_eth_tx_prepare(), or per driver if needed.
Fixes: 4fb7e803eb1a ("ethdev: add Tx preparation")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 6711dda42e..f5125c2788 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -583,15 +583,17 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t ol_flags = 0;
if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4)) {
- ipv4_hdr->hdr_checksum = 0;
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4;
- if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM)
+ if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
- else
+ } else {
+ ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
- } else
+ }
+ } else {
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV6;
+ }
if (info->outer_l4_proto != IPPROTO_UDP)
return ol_flags;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
2024-04-05 12:49 ` [PATCH 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
@ 2024-04-05 12:49 ` David Marchand
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
` (3 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 12:49 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang, Jie Hai,
Yisen Zhuang, Ferruh Yigit, Ting Xu
Setting a pseudo header checksum in the outer UDP checksum is a Intel
(and some other vendors) requirement.
Applications (like OVS) requesting outer UDP checksum without doing this
extra setup have broken outer UDP checksums.
Move this specific setup from testpmd to the "common" helper
rte_net_intel_cksum_flags_prepare().
net/hns3 can then be adjusted.
Bugzilla ID: 1406
Fixes: d8e5e69f3a9b ("app/testpmd: add GTP parsing and Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 11 +----
drivers/net/hns3/hns3_rxtx.c | 93 ++++++++++--------------------------
lib/net/rte_net.h | 18 ++++++-
3 files changed, 44 insertions(+), 78 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index f5125c2788..71add6ca47 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -577,8 +577,6 @@ static uint64_t
process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t tx_offloads, int tso_enabled, struct rte_mbuf *m)
{
- struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
- struct rte_ipv6_hdr *ipv6_hdr = outer_l3_hdr;
struct rte_udp_hdr *udp_hdr;
uint64_t ol_flags = 0;
@@ -588,6 +586,8 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
} else {
+ struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
+
ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
}
@@ -608,13 +608,6 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
/* Skip SW outer UDP checksum generation if HW supports it */
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM) {
- if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4))
- udp_hdr->dgram_cksum
- = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
- else
- udp_hdr->dgram_cksum
- = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
ol_flags |= RTE_MBUF_F_TX_OUTER_UDP_CKSUM;
return ol_flags;
}
diff --git a/drivers/net/hns3/hns3_rxtx.c b/drivers/net/hns3/hns3_rxtx.c
index 7e636a0a2e..03fc919fd7 100644
--- a/drivers/net/hns3/hns3_rxtx.c
+++ b/drivers/net/hns3/hns3_rxtx.c
@@ -3616,58 +3616,6 @@ hns3_pkt_need_linearized(struct rte_mbuf *tx_pkts, uint32_t bd_num,
return false;
}
-static bool
-hns3_outer_ipv4_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_IP_CKSUM)
- ipv4_hdr->hdr_checksum = 0;
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv4_hdr->next_proto_id;
- return false;
-}
-
-static bool
-hns3_outer_ipv6_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv6_hdr *ipv6_hdr;
- ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv6_hdr->proto;
- return false;
-}
-
static void
hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
{
@@ -3675,29 +3623,38 @@ hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
uint32_t paylen, hdr_len, l4_proto;
struct rte_udp_hdr *udp_hdr;
- if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)))
+ if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) &&
+ ((ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) ||
+ !(ol_flags & RTE_MBUF_F_TX_TCP_SEG)))
return;
if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
- if (hns3_outer_ipv4_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv4_hdr *ipv4_hdr;
+
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv4_hdr->next_proto_id;
} else {
- if (hns3_outer_ipv6_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv6_hdr *ipv6_hdr;
+
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv6_hdr->proto;
}
+ if (l4_proto != IPPROTO_UDP)
+ return;
+
/* driver should ensure the outer udp cksum is 0 for TUNNEL TSO */
- if (l4_proto == IPPROTO_UDP && (ol_flags & RTE_MBUF_F_TX_TCP_SEG)) {
- hdr_len = m->l2_len + m->l3_len + m->l4_len;
- hdr_len += m->outer_l2_len + m->outer_l3_len;
- paylen = m->pkt_len - hdr_len;
- if (paylen <= m->tso_segsz)
- return;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len +
- m->outer_l3_len);
- udp_hdr->dgram_cksum = 0;
- }
+ hdr_len = m->l2_len + m->l3_len + m->l4_len;
+ hdr_len += m->outer_l2_len + m->outer_l3_len;
+ paylen = m->pkt_len - hdr_len;
+ if (paylen <= m->tso_segsz)
+ return;
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = 0;
}
static int
diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h
index ef3ff4c6fd..efd9d5f5ee 100644
--- a/lib/net/rte_net.h
+++ b/lib/net/rte_net.h
@@ -121,7 +121,8 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
* no offloads are requested.
*/
if (!(ol_flags & (RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_L4_MASK | RTE_MBUF_F_TX_TCP_SEG |
- RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM)))
+ RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM |
+ RTE_MBUF_F_TX_OUTER_UDP_CKSUM)))
return 0;
if (ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) {
@@ -135,6 +136,21 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
struct rte_ipv4_hdr *, m->outer_l2_len);
ipv4_hdr->hdr_checksum = 0;
}
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ udp_hdr = (struct rte_udp_hdr *)((char *)ipv4_hdr +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, m->ol_flags);
+ } else {
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len + m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, m->ol_flags);
+ }
+ }
}
/*
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
` (2 preceding siblings ...)
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
@ 2024-04-05 12:49 ` David Marchand
2024-04-05 12:49 ` [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
` (2 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 12:49 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jun Wang, Yuying Zhang,
Beilei Xing, Jie Wang
According to the X710 datasheet (and confirmed on the field..), X710
devices do not support outer checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities according to the hardware.
X722 may support such offload by setting I40E_TXD_CTX_QW0_L4T_CS_MASK.
Bugzilla ID: 1406
Fixes: 8cc79a1636cd ("net/i40e: fix forward outer IPv6 VXLAN")
Cc: stable@dpdk.org
Reported-by: Jun Wang <junwang01@cestc.cn>
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Note: I do not have X722 nic. Please Intel devs, check for both X710 and
X722 series.
---
.mailmap | 1 +
drivers/net/i40e/i40e_ethdev.c | 6 +++++-
drivers/net/i40e/i40e_rxtx.c | 9 +++++++++
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/.mailmap b/.mailmap
index 3843868716..091766eca7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -719,6 +719,7 @@ Junjie Wan <wanjunjie@bytedance.com>
Jun Qiu <jun.qiu@jaguarmicro.com>
Jun W Zhou <junx.w.zhou@intel.com>
Junxiao Shi <git@mail1.yoursunny.com>
+Jun Wang <junwang01@cestc.cn>
Jun Yang <jun.yang@nxp.com>
Junyu Jiang <junyux.jiang@intel.com>
Juraj Linkeš <juraj.linkes@pantheon.tech>
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 380ce1a720..6535c7c178 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -3862,8 +3862,12 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_IPIP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO |
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
dev_info->tx_queue_offload_capa;
+ if (hw->mac.type == I40E_MAC_X722) {
+ dev_info->tx_offload_capa |=
+ RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+ }
+
dev_info->dev_capa =
RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP |
RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP;
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5d25ab4d3a..a649911494 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -295,6 +295,15 @@ i40e_parse_tunneling_params(uint64_t ol_flags,
*/
*cd_tunneling |= (tx_offload.l2_len >> 1) <<
I40E_TXD_CTX_QW0_NATLEN_SHIFT;
+
+ /**
+ * Calculate the tunneling UDP checksum (only supported with X722).
+ * Shall be set only if L4TUNT = 01b and EIPT is not zero
+ */
+ if (!(*cd_tunneling & I40E_TXD_CTX_QW0_EXT_IP_MASK) &&
+ (*cd_tunneling & I40E_TXD_CTX_UDP_TUNNELING) &&
+ (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM))
+ *cd_tunneling |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
}
static inline void
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
` (3 preceding siblings ...)
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
@ 2024-04-05 12:49 ` David Marchand
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
6 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 12:49 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jingjing Wu, Zhichao Zeng,
Peng Zhang, Qi Zhang
According to the X710 datasheet, X710 devices do not support outer
checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities depending on the VF type.
Bugzilla ID: 1406
Fixes: f7c8c36fdeb7 ("net/iavf: enable inner and outer Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
doc/guides/nics/features/iavf.ini | 2 +-
drivers/net/iavf/iavf_ethdev.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/doc/guides/nics/features/iavf.ini b/doc/guides/nics/features/iavf.ini
index c59115ae15..ce9860e963 100644
--- a/doc/guides/nics/features/iavf.ini
+++ b/doc/guides/nics/features/iavf.ini
@@ -33,7 +33,7 @@ L3 checksum offload = Y
L4 checksum offload = Y
Timestamp offload = Y
Inner L3 checksum = Y
-Inner L4 checksum = Y
+Inner L4 checksum = P
Packet type parsing = Y
Rx descriptor status = Y
Tx descriptor status = Y
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 245b3cd854..bbf915097e 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1174,7 +1174,6 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GRE_TNL_TSO |
@@ -1183,6 +1182,10 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;
+ /* X710 does not support outer udp checksum */
+ if (adapter->hw.mac.type != IAVF_MAC_XL710)
+ dev_info->tx_offload_capa |= RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_CRC)
dev_info->rx_offload_capa |= RTE_ETH_RX_OFFLOAD_KEEP_CRC;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
@ 2024-04-05 14:45 ` David Marchand
2024-04-05 16:20 ` Morten Brørup
2024-04-05 14:45 ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
` (3 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: David Marchand @ 2024-04-05 14:45 UTC (permalink / raw)
To: dev; +Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu
Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
examples.
Remove unneeded resetting of checksums to align with the mbuf
API doxygen.
Clarify the case when requesting "inner" checksum offload with
lack of outer L4 checksum offload.
Fixes: f00dcb7b0a74 ("mbuf: fix Tx checksum offload API doc")
Fixes: 609dd68ef14f ("mbuf: enhance the API documentation of offload flags")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
doc/guides/prog_guide/mbuf_lib.rst | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/doc/guides/prog_guide/mbuf_lib.rst b/doc/guides/prog_guide/mbuf_lib.rst
index 049357c755..4e285c0aab 100644
--- a/doc/guides/prog_guide/mbuf_lib.rst
+++ b/doc/guides/prog_guide/mbuf_lib.rst
@@ -126,6 +126,9 @@ processing to the hardware if it supports it. For instance, the
RTE_MBUF_F_TX_IP_CKSUM flag allows to offload the computation of the IPv4
checksum.
+Support for such processing by the hardware is advertised through RTE_ETH_TX_OFFLOAD_* capabilities.
+Please note that a call to ``rte_eth_tx_prepare`` is needed to handle driver specific requirements such as resetting some checksum fields.
+
The following examples explain how to configure different TX offloads on
a vxlan-encapsulated tcp packet:
``out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload``
@@ -135,7 +138,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM
- set out_ip checksum to 0 in the packet
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
@@ -144,8 +146,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM | RTE_MBUF_F_TX_UDP_CKSUM
- set out_ip checksum to 0 in the packet
- set out_udp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
and RTE_ETH_TX_OFFLOAD_UDP_CKSUM.
@@ -155,7 +155,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM
- set in_ip checksum to 0 in the packet
This is similar to case 1), but l2_len is different. It is supported
on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
@@ -166,8 +165,6 @@ a vxlan-encapsulated tcp packet:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CSUM | RTE_MBUF_F_TX_TCP_CKSUM
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is similar to case 2), but l2_len is different. It is supported
on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM and
@@ -181,9 +178,6 @@ a vxlan-encapsulated tcp packet:
mb->l4_len = len(in_tcp)
mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM |
RTE_MBUF_F_TX_TCP_SEG;
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header without including the IP
- payload length using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_TCP_TSO.
Note that it can only work if outer L4 checksum is 0.
@@ -196,12 +190,10 @@ a vxlan-encapsulated tcp packet:
mb->l3_len = len(in_ip)
mb->ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IP_CKSUM | \
RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM;
- set out_ip checksum to 0 in the packet
- set in_ip checksum to 0 in the packet
- set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM,
RTE_ETH_TX_OFFLOAD_UDP_CKSUM and RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM.
+ Note that it can only work if outer L4 checksum is 0.
The list of flags and their precise meaning is described in the mbuf API
documentation (rte_mbuf.h). Also refer to the testpmd source code
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
2024-04-05 14:45 ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
@ 2024-04-05 14:45 ` David Marchand
2024-04-05 14:45 ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
` (2 subsequent siblings)
4 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 14:45 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang,
Tomasz Kulasek, Konstantin Ananyev, Olivier Matz
Resetting the outer IP checksum to 0 is not something mandated by the
mbuf API and is done by rte_eth_tx_prepare(), or per driver if needed.
Fixes: 4fb7e803eb1a ("ethdev: add Tx preparation")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 6711dda42e..f5125c2788 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -583,15 +583,17 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t ol_flags = 0;
if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4)) {
- ipv4_hdr->hdr_checksum = 0;
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4;
- if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM)
+ if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
- else
+ } else {
+ ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
- } else
+ }
+ } else {
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV6;
+ }
if (info->outer_l4_proto != IPPROTO_UDP)
return ol_flags;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
2024-04-05 14:45 ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 14:45 ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
@ 2024-04-05 14:45 ` David Marchand
2024-04-05 14:46 ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 14:46 ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
4 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 14:45 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang, Jie Hai,
Yisen Zhuang, Ting Xu
Setting a pseudo header checksum in the outer UDP checksum is a Intel
(and some other vendors) requirement.
Applications (like OVS) requesting outer UDP checksum without doing this
extra setup have broken outer UDP checksums.
Move this specific setup from testpmd to the "common" helper
rte_net_intel_cksum_flags_prepare().
net/hns3 can then be adjusted.
Bugzilla ID: 1406
Fixes: d8e5e69f3a9b ("app/testpmd: add GTP parsing and Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 11 +----
drivers/net/hns3/hns3_rxtx.c | 93 ++++++++++--------------------------
lib/net/rte_net.h | 18 ++++++-
3 files changed, 44 insertions(+), 78 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index f5125c2788..71add6ca47 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -577,8 +577,6 @@ static uint64_t
process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t tx_offloads, int tso_enabled, struct rte_mbuf *m)
{
- struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
- struct rte_ipv6_hdr *ipv6_hdr = outer_l3_hdr;
struct rte_udp_hdr *udp_hdr;
uint64_t ol_flags = 0;
@@ -588,6 +586,8 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
} else {
+ struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
+
ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
}
@@ -608,13 +608,6 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
/* Skip SW outer UDP checksum generation if HW supports it */
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM) {
- if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4))
- udp_hdr->dgram_cksum
- = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
- else
- udp_hdr->dgram_cksum
- = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
ol_flags |= RTE_MBUF_F_TX_OUTER_UDP_CKSUM;
return ol_flags;
}
diff --git a/drivers/net/hns3/hns3_rxtx.c b/drivers/net/hns3/hns3_rxtx.c
index 7e636a0a2e..03fc919fd7 100644
--- a/drivers/net/hns3/hns3_rxtx.c
+++ b/drivers/net/hns3/hns3_rxtx.c
@@ -3616,58 +3616,6 @@ hns3_pkt_need_linearized(struct rte_mbuf *tx_pkts, uint32_t bd_num,
return false;
}
-static bool
-hns3_outer_ipv4_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_IP_CKSUM)
- ipv4_hdr->hdr_checksum = 0;
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv4_hdr->next_proto_id;
- return false;
-}
-
-static bool
-hns3_outer_ipv6_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv6_hdr *ipv6_hdr;
- ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv6_hdr->proto;
- return false;
-}
-
static void
hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
{
@@ -3675,29 +3623,38 @@ hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
uint32_t paylen, hdr_len, l4_proto;
struct rte_udp_hdr *udp_hdr;
- if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)))
+ if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) &&
+ ((ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) ||
+ !(ol_flags & RTE_MBUF_F_TX_TCP_SEG)))
return;
if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
- if (hns3_outer_ipv4_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv4_hdr *ipv4_hdr;
+
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv4_hdr->next_proto_id;
} else {
- if (hns3_outer_ipv6_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv6_hdr *ipv6_hdr;
+
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv6_hdr->proto;
}
+ if (l4_proto != IPPROTO_UDP)
+ return;
+
/* driver should ensure the outer udp cksum is 0 for TUNNEL TSO */
- if (l4_proto == IPPROTO_UDP && (ol_flags & RTE_MBUF_F_TX_TCP_SEG)) {
- hdr_len = m->l2_len + m->l3_len + m->l4_len;
- hdr_len += m->outer_l2_len + m->outer_l3_len;
- paylen = m->pkt_len - hdr_len;
- if (paylen <= m->tso_segsz)
- return;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len +
- m->outer_l3_len);
- udp_hdr->dgram_cksum = 0;
- }
+ hdr_len = m->l2_len + m->l3_len + m->l4_len;
+ hdr_len += m->outer_l2_len + m->outer_l3_len;
+ paylen = m->pkt_len - hdr_len;
+ if (paylen <= m->tso_segsz)
+ return;
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = 0;
}
static int
diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h
index ef3ff4c6fd..efd9d5f5ee 100644
--- a/lib/net/rte_net.h
+++ b/lib/net/rte_net.h
@@ -121,7 +121,8 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
* no offloads are requested.
*/
if (!(ol_flags & (RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_L4_MASK | RTE_MBUF_F_TX_TCP_SEG |
- RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM)))
+ RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM |
+ RTE_MBUF_F_TX_OUTER_UDP_CKSUM)))
return 0;
if (ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) {
@@ -135,6 +136,21 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
struct rte_ipv4_hdr *, m->outer_l2_len);
ipv4_hdr->hdr_checksum = 0;
}
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ udp_hdr = (struct rte_udp_hdr *)((char *)ipv4_hdr +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, m->ol_flags);
+ } else {
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len + m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, m->ol_flags);
+ }
+ }
}
/*
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
` (2 preceding siblings ...)
2024-04-05 14:45 ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
@ 2024-04-05 14:46 ` David Marchand
2024-04-05 14:46 ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
4 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 14:46 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jun Wang, Yuying Zhang,
Beilei Xing, Jie Wang
According to the X710 datasheet (and confirmed on the field..), X710
devices do not support outer checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities according to the hardware.
X722 may support such offload by setting I40E_TXD_CTX_QW0_L4T_CS_MASK.
Bugzilla ID: 1406
Fixes: 8cc79a1636cd ("net/i40e: fix forward outer IPv6 VXLAN")
Cc: stable@dpdk.org
Reported-by: Jun Wang <junwang01@cestc.cn>
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Note: I do not have X722 nic. Please Intel devs, check for both X710 and
X722 series.
Changes since v1:
- fix inverted check,
---
.mailmap | 1 +
drivers/net/i40e/i40e_ethdev.c | 6 +++++-
drivers/net/i40e/i40e_rxtx.c | 9 +++++++++
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/.mailmap b/.mailmap
index 3843868716..091766eca7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -719,6 +719,7 @@ Junjie Wan <wanjunjie@bytedance.com>
Jun Qiu <jun.qiu@jaguarmicro.com>
Jun W Zhou <junx.w.zhou@intel.com>
Junxiao Shi <git@mail1.yoursunny.com>
+Jun Wang <junwang01@cestc.cn>
Jun Yang <jun.yang@nxp.com>
Junyu Jiang <junyux.jiang@intel.com>
Juraj Linkeš <juraj.linkes@pantheon.tech>
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 380ce1a720..6535c7c178 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -3862,8 +3862,12 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_IPIP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO |
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
dev_info->tx_queue_offload_capa;
+ if (hw->mac.type == I40E_MAC_X722) {
+ dev_info->tx_offload_capa |=
+ RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+ }
+
dev_info->dev_capa =
RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP |
RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP;
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5d25ab4d3a..b4f7599cfc 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -295,6 +295,15 @@ i40e_parse_tunneling_params(uint64_t ol_flags,
*/
*cd_tunneling |= (tx_offload.l2_len >> 1) <<
I40E_TXD_CTX_QW0_NATLEN_SHIFT;
+
+ /**
+ * Calculate the tunneling UDP checksum (only supported with X722).
+ * Shall be set only if L4TUNT = 01b and EIPT is not zero
+ */
+ if ((*cd_tunneling & I40E_TXD_CTX_QW0_EXT_IP_MASK) &&
+ (*cd_tunneling & I40E_TXD_CTX_UDP_TUNNELING) &&
+ (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM))
+ *cd_tunneling |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
}
static inline void
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
` (3 preceding siblings ...)
2024-04-05 14:46 ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
@ 2024-04-05 14:46 ` David Marchand
4 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-05 14:46 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jingjing Wu, Zhichao Zeng,
Peng Zhang, Qi Zhang
According to the X710 datasheet, X710 devices do not support outer
checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities depending on the VF type.
Bugzilla ID: 1406
Fixes: f7c8c36fdeb7 ("net/iavf: enable inner and outer Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
doc/guides/nics/features/iavf.ini | 2 +-
drivers/net/iavf/iavf_ethdev.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/doc/guides/nics/features/iavf.ini b/doc/guides/nics/features/iavf.ini
index c59115ae15..ce9860e963 100644
--- a/doc/guides/nics/features/iavf.ini
+++ b/doc/guides/nics/features/iavf.ini
@@ -33,7 +33,7 @@ L3 checksum offload = Y
L4 checksum offload = Y
Timestamp offload = Y
Inner L3 checksum = Y
-Inner L4 checksum = Y
+Inner L4 checksum = P
Packet type parsing = Y
Rx descriptor status = Y
Tx descriptor status = Y
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 245b3cd854..bbf915097e 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1174,7 +1174,6 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GRE_TNL_TSO |
@@ -1183,6 +1182,10 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;
+ /* X710 does not support outer udp checksum */
+ if (adapter->hw.mac.type != IAVF_MAC_XL710)
+ dev_info->tx_offload_capa |= RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_CRC)
dev_info->rx_offload_capa |= RTE_ETH_RX_OFFLOAD_KEEP_CRC;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-05 14:45 ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
@ 2024-04-05 16:20 ` Morten Brørup
2024-04-08 10:12 ` David Marchand
2024-04-09 13:38 ` Konstantin Ananyev
0 siblings, 2 replies; 30+ messages in thread
From: Morten Brørup @ 2024-04-05 16:20 UTC (permalink / raw)
To: David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, konstantin.ananyev, Ferruh Yigit, Kaiwen Deng,
qiming.yang, yidingx.zhou, Aman Singh, Yuying Zhang,
Thomas Monjalon, Jerin Jacob
> From: David Marchand [mailto:david.marchand@redhat.com]
> Sent: Friday, 5 April 2024 16.46
>
> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> examples.
I strongly disagree with this change!
It will cause a huge performance degradation for shaping applications:
A packet will be processed and finalized at an output or forwarding pipeline stage, where some other fields might also be written, so zeroing e.g. the out_ip checksum at this stage has low cost (no new cache misses).
Then, the packet might be queued for QoS or similar.
If rte_eth_tx_prepare() must be called at the egress pipeline stage, it has to write to the packet and cause a cache miss per packet, instead of simply passing on the packet to the NIC hardware.
It must be possible to finalize the packet at the output/forwarding pipeline stage!
Also, how is rte_eth_tx_prepare() supposed to work for cloned packets egressing on different NIC hardware?
In theory, it might get even worse if we make this opaque instead of transparent and standardized:
One PMD might reset out_ip checksum to 0x0000, and another PMD might reset it to 0xFFFF.
I can only see one solution:
We need to standardize on common minimum requirements for how to prepare packets for each TX offload.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-05 16:20 ` Morten Brørup
@ 2024-04-08 10:12 ` David Marchand
2024-04-09 13:38 ` Konstantin Ananyev
1 sibling, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-08 10:12 UTC (permalink / raw)
To: Morten Brørup
Cc: dev, stable, Olivier Matz, Jijiang Liu, Andrew Rybchenko,
konstantin.ananyev, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
On Fri, Apr 5, 2024 at 6:20 PM Morten Brørup <mb@smartsharesystems.com> wrote:
> > From: David Marchand [mailto:david.marchand@redhat.com]
> > Sent: Friday, 5 April 2024 16.46
> >
> > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > examples.
>
> I strongly disagree with this change!
Ok, I will withdraw this patch!
Revamping the checksum offload API is a larger work that I don't have
time (nor expertise) to work on.
If you want to come up with a common API, I can review.
Thanks.
--
David Marchand
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-05 16:20 ` Morten Brørup
2024-04-08 10:12 ` David Marchand
@ 2024-04-09 13:38 ` Konstantin Ananyev
2024-04-09 14:44 ` Morten Brørup
1 sibling, 1 reply; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-09 13:38 UTC (permalink / raw)
To: Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > From: David Marchand [mailto:david.marchand@redhat.com]
> > Sent: Friday, 5 April 2024 16.46
> >
> > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > examples.
>
> I strongly disagree with this change!
>
> It will cause a huge performance degradation for shaping applications:
>
> A packet will be processed and finalized at an output or forwarding pipeline stage, where some other fields might also be written, so
> zeroing e.g. the out_ip checksum at this stage has low cost (no new cache misses).
>
> Then, the packet might be queued for QoS or similar.
>
> If rte_eth_tx_prepare() must be called at the egress pipeline stage, it has to write to the packet and cause a cache miss per packet,
> instead of simply passing on the packet to the NIC hardware.
>
> It must be possible to finalize the packet at the output/forwarding pipeline stage!
If you can finalize your packet on output/forwarding, then why you can't invoke tx_prepare() on the same stage?
There seems to be some misunderstanding about what tx_prepare() does -
in fact it doesn't communicate with HW queue (doesn't update TXD ring, etc.), what it does - just make changes in mbuf itself.
Yes, it reads some fields in SW TX queue struct (max number of TXDs per packet, etc.), but AFAIK it is safe
to call tx_prepare() and tx_burst() from different threads.
At least on implementations I am aware about.
Just checked the docs - it seems not stated explicitly anywhere, might be that's why it causing such misunderstanding.
>
> Also, how is rte_eth_tx_prepare() supposed to work for cloned packets egressing on different NIC hardware?
If you create a clone of full packet (including L2/L3) headers then obviously such construction might not
work properly with tx_prepare() over two different NICs.
Though In majority of cases you do clone segments with data, while at least L2 headers are put into different segments.
One simple approach would be to keep L3 header in that separate segment.
But yes, there is a problem when you'll need to send exactly the same packet over different NICs.
As I remember, for bonding PMD things don't work quite well here - you might have a bond over 2 NICs with
different tx_prepare() and which one to call might be not clear till actual PMD tx_burst() is invoked.
>
> In theory, it might get even worse if we make this opaque instead of transparent and standardized:
> One PMD might reset out_ip checksum to 0x0000, and another PMD might reset it to 0xFFFF.
>
> I can only see one solution:
> We need to standardize on common minimum requirements for how to prepare packets for each TX offload.
If we can make each and every vendor to agree here - that definitely will help to simplify things quite a bit.
Then we can probably have one common tx_prepare() for all vendors ;)
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-09 13:38 ` Konstantin Ananyev
@ 2024-04-09 14:44 ` Morten Brørup
2024-04-10 10:35 ` Konstantin Ananyev
0 siblings, 1 reply; 30+ messages in thread
From: Morten Brørup @ 2024-04-09 14:44 UTC (permalink / raw)
To: Konstantin Ananyev, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Tuesday, 9 April 2024 15.39
>
> > > From: David Marchand [mailto:david.marchand@redhat.com]
> > > Sent: Friday, 5 April 2024 16.46
> > >
> > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > examples.
> >
> > I strongly disagree with this change!
> >
> > It will cause a huge performance degradation for shaping applications:
> >
> > A packet will be processed and finalized at an output or forwarding
> pipeline stage, where some other fields might also be written, so
> > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> cache misses).
> >
> > Then, the packet might be queued for QoS or similar.
> >
> > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> it has to write to the packet and cause a cache miss per packet,
> > instead of simply passing on the packet to the NIC hardware.
> >
> > It must be possible to finalize the packet at the output/forwarding
> pipeline stage!
>
> If you can finalize your packet on output/forwarding, then why you
> can't invoke tx_prepare() on the same stage?
> There seems to be some misunderstanding about what tx_prepare() does -
> in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> etc.), what it does - just make changes in mbuf itself.
> Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> packet, etc.), but AFAIK it is safe
> to call tx_prepare() and tx_burst() from different threads.
> At least on implementations I am aware about.
> Just checked the docs - it seems not stated explicitly anywhere, might
> be that's why it causing such misunderstanding.
>
> >
> > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> egressing on different NIC hardware?
>
> If you create a clone of full packet (including L2/L3) headers then
> obviously such construction might not
> work properly with tx_prepare() over two different NICs.
> Though In majority of cases you do clone segments with data, while at
> least L2 headers are put into different segments.
> One simple approach would be to keep L3 header in that separate segment.
> But yes, there is a problem when you'll need to send exactly the same
> packet over different NICs.
> As I remember, for bonding PMD things don't work quite well here - you
> might have a bond over 2 NICs with
> different tx_prepare() and which one to call might be not clear till
> actual PMD tx_burst() is invoked.
>
> >
> > In theory, it might get even worse if we make this opaque instead of
> transparent and standardized:
> > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> reset it to 0xFFFF.
>
> >
> > I can only see one solution:
> > We need to standardize on common minimum requirements for how to
> prepare packets for each TX offload.
>
> If we can make each and every vendor to agree here - that definitely
> will help to simplify things quite a bit.
An API is more than a function name and parameters.
It also has preconditions and postconditions.
All major NIC vendors are contributing to DPDK.
It should be possible to reach consensus for reasonable minimum requirements for offloads.
Hardware- and driver-specific exceptions can be documented with the offload flag, or with rte_eth_rx/tx_burst(), like the note to rte_eth_rx_burst():
"Some drivers using vector instructions require that nb_pkts is divisible by 4 or 8, depending on the driver implementation."
You mention the bonding driver, which is a good example.
The rte_eth_tx_burst() documentation has a note about the API postcondition exception for the bonding driver:
"This function must not modify mbufs (including packets data) unless the refcnt is 1. An exception is the bonding PMD, [...], mbufs may be modified."
> Then we can probably have one common tx_prepare() for all vendors ;)
Yes, that would be the goal.
More realistically, the ethdev layer could perform the common checks, and only the non-conforming drivers would have to implement their specific tweaks.
If we don't standardize the meaning of the offload flags, the application developers cannot trust them!
I'm afraid this is the current situation - application developers either test with specific NIC hardware, or don't use the offload features.
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-09 14:44 ` Morten Brørup
@ 2024-04-10 10:35 ` Konstantin Ananyev
2024-04-10 12:20 ` Morten Brørup
0 siblings, 1 reply; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-10 10:35 UTC (permalink / raw)
To: Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Tuesday, 9 April 2024 15.39
> >
> > > > From: David Marchand [mailto:david.marchand@redhat.com]
> > > > Sent: Friday, 5 April 2024 16.46
> > > >
> > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > > examples.
> > >
> > > I strongly disagree with this change!
> > >
> > > It will cause a huge performance degradation for shaping applications:
> > >
> > > A packet will be processed and finalized at an output or forwarding
> > pipeline stage, where some other fields might also be written, so
> > > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> > cache misses).
> > >
> > > Then, the packet might be queued for QoS or similar.
> > >
> > > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> > it has to write to the packet and cause a cache miss per packet,
> > > instead of simply passing on the packet to the NIC hardware.
> > >
> > > It must be possible to finalize the packet at the output/forwarding
> > pipeline stage!
> >
> > If you can finalize your packet on output/forwarding, then why you
> > can't invoke tx_prepare() on the same stage?
> > There seems to be some misunderstanding about what tx_prepare() does -
> > in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> > etc.), what it does - just make changes in mbuf itself.
> > Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> > packet, etc.), but AFAIK it is safe
> > to call tx_prepare() and tx_burst() from different threads.
> > At least on implementations I am aware about.
> > Just checked the docs - it seems not stated explicitly anywhere, might
> > be that's why it causing such misunderstanding.
> >
> > >
> > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> > egressing on different NIC hardware?
> >
> > If you create a clone of full packet (including L2/L3) headers then
> > obviously such construction might not
> > work properly with tx_prepare() over two different NICs.
> > Though In majority of cases you do clone segments with data, while at
> > least L2 headers are put into different segments.
> > One simple approach would be to keep L3 header in that separate segment.
> > But yes, there is a problem when you'll need to send exactly the same
> > packet over different NICs.
> > As I remember, for bonding PMD things don't work quite well here - you
> > might have a bond over 2 NICs with
> > different tx_prepare() and which one to call might be not clear till
> > actual PMD tx_burst() is invoked.
> >
> > >
> > > In theory, it might get even worse if we make this opaque instead of
> > transparent and standardized:
> > > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> > reset it to 0xFFFF.
> >
> > >
> > > I can only see one solution:
> > > We need to standardize on common minimum requirements for how to
> > prepare packets for each TX offload.
> >
> > If we can make each and every vendor to agree here - that definitely
> > will help to simplify things quite a bit.
>
> An API is more than a function name and parameters.
> It also has preconditions and postconditions.
>
> All major NIC vendors are contributing to DPDK.
> It should be possible to reach consensus for reasonable minimum requirements for offloads.
> Hardware- and driver-specific exceptions can be documented with the offload flag, or with rte_eth_rx/tx_burst(), like the note to
> rte_eth_rx_burst():
> "Some drivers using vector instructions require that nb_pkts is divisible by 4 or 8, depending on the driver implementation."
If we introduce a rule that everyone supposed to follow and then straightway allow people to have a 'documented exceptions',
for me it means like 'no rule' in practice.
A 'documented exceptions' approach might work if you have 5 different PMDs to support, but not when you have 50+.
No-one would write an app with possible 10 different exception cases in his head.
Again, with such approach we can forget about backward compatibility.
I think we already had this discussion before, my opinion remains the same here -
'documented exceptions' approach is a way to trouble.
> You mention the bonding driver, which is a good example.
> The rte_eth_tx_burst() documentation has a note about the API postcondition exception for the bonding driver:
> "This function must not modify mbufs (including packets data) unless the refcnt is 1. An exception is the bonding PMD, [...], mbufs
> may be modified."
For me, what we've done for bonding tx_prepare/tx_burst() is a really bad example.
Initial agreement and design choice was that tx_burst() should not modify contents of the packets
(that actually was one of the reasons why tx_prepare() was introduced).
The only reason I agreed on that exception - because I couldn't come-up with something less uglier.
Actually, these problems with bonding PMD made me to start thinking that current
tx_prepare/tx_burst approach might need to be reconsidered somehow.
> > Then we can probably have one common tx_prepare() for all vendors ;)
>
> Yes, that would be the goal.
> More realistically, the ethdev layer could perform the common checks, and only the non-conforming drivers would have to implement
> their specific tweaks.
Hmm, but that's what we have right now:
- fields in mbuf and packet data that user has to fill correctly and dev specific tx_prepare().
How what you suggest will differ then?
And how it will help let say with bonding PMD situation, or with TX-ing of the same packet over 2 different NICs?
> If we don't standardize the meaning of the offload flags, the application developers cannot trust them!
> I'm afraid this is the current situation - application developers either test with specific NIC hardware, or don't use the offload features.
Well, I have used TX offloads through several projects, it worked quite well.
Though have to admit, never have to use TX offloads together with our bonding PMD.
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-10 10:35 ` Konstantin Ananyev
@ 2024-04-10 12:20 ` Morten Brørup
2024-04-12 12:46 ` Konstantin Ananyev
0 siblings, 1 reply; 30+ messages in thread
From: Morten Brørup @ 2024-04-10 12:20 UTC (permalink / raw)
To: Konstantin Ananyev, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Wednesday, 10 April 2024 12.35
>
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Tuesday, 9 April 2024 15.39
> > >
> > > > > From: David Marchand [mailto:david.marchand@redhat.com]
> > > > > Sent: Friday, 5 April 2024 16.46
> > > > >
> > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > > > examples.
> > > >
> > > > I strongly disagree with this change!
> > > >
> > > > It will cause a huge performance degradation for shaping applications:
> > > >
> > > > A packet will be processed and finalized at an output or forwarding
> > > pipeline stage, where some other fields might also be written, so
> > > > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> > > cache misses).
> > > >
> > > > Then, the packet might be queued for QoS or similar.
> > > >
> > > > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> > > it has to write to the packet and cause a cache miss per packet,
> > > > instead of simply passing on the packet to the NIC hardware.
> > > >
> > > > It must be possible to finalize the packet at the output/forwarding
> > > pipeline stage!
> > >
> > > If you can finalize your packet on output/forwarding, then why you
> > > can't invoke tx_prepare() on the same stage?
> > > There seems to be some misunderstanding about what tx_prepare() does -
> > > in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> > > etc.), what it does - just make changes in mbuf itself.
> > > Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> > > packet, etc.), but AFAIK it is safe
> > > to call tx_prepare() and tx_burst() from different threads.
> > > At least on implementations I am aware about.
> > > Just checked the docs - it seems not stated explicitly anywhere, might
> > > be that's why it causing such misunderstanding.
> > >
> > > >
> > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> > > egressing on different NIC hardware?
> > >
> > > If you create a clone of full packet (including L2/L3) headers then
> > > obviously such construction might not
> > > work properly with tx_prepare() over two different NICs.
> > > Though In majority of cases you do clone segments with data, while at
> > > least L2 headers are put into different segments.
> > > One simple approach would be to keep L3 header in that separate segment.
> > > But yes, there is a problem when you'll need to send exactly the same
> > > packet over different NICs.
> > > As I remember, for bonding PMD things don't work quite well here - you
> > > might have a bond over 2 NICs with
> > > different tx_prepare() and which one to call might be not clear till
> > > actual PMD tx_burst() is invoked.
> > >
> > > >
> > > > In theory, it might get even worse if we make this opaque instead of
> > > transparent and standardized:
> > > > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> > > reset it to 0xFFFF.
> > >
> > > >
> > > > I can only see one solution:
> > > > We need to standardize on common minimum requirements for how to
> > > prepare packets for each TX offload.
> > >
> > > If we can make each and every vendor to agree here - that definitely
> > > will help to simplify things quite a bit.
> >
> > An API is more than a function name and parameters.
> > It also has preconditions and postconditions.
> >
> > All major NIC vendors are contributing to DPDK.
> > It should be possible to reach consensus for reasonable minimum requirements
> for offloads.
> > Hardware- and driver-specific exceptions can be documented with the offload
> flag, or with rte_eth_rx/tx_burst(), like the note to
> > rte_eth_rx_burst():
> > "Some drivers using vector instructions require that nb_pkts is divisible by
> 4 or 8, depending on the driver implementation."
>
> If we introduce a rule that everyone supposed to follow and then straightway
> allow people to have a 'documented exceptions',
> for me it means like 'no rule' in practice.
> A 'documented exceptions' approach might work if you have 5 different PMDs to
> support, but not when you have 50+.
> No-one would write an app with possible 10 different exception cases in his
> head.
> Again, with such approach we can forget about backward compatibility.
> I think we already had this discussion before, my opinion remains the same
> here -
> 'documented exceptions' approach is a way to trouble.
The "minimum requirements" should be the lowest common denominator of all NICs.
Exceptions should be extremely few, for outlier NICs that still want to provide an offload and its driver is unable to live up to the minimum requirements.
Any exception should require techboard approval. If a NIC/driver does not support the "minimum requirements" for an offload feature, it is not allowed to claim support for that offload feature, or needs to seek approval for an exception.
As another option for NICs not supporting the minimum requirements of an offload feature, we could introduce offload flags with finer granularity. E.g. one offload flag for "gold standard" TX checksum update (where the packet's checksum field can have any value), and another offload flag for "silver standard" TX checksum update (where the packet's checksum field must have a precomputed value).
For reference, consider RSS, where the feature support flags have very high granularity.
>
> > You mention the bonding driver, which is a good example.
> > The rte_eth_tx_burst() documentation has a note about the API postcondition
> exception for the bonding driver:
> > "This function must not modify mbufs (including packets data) unless the
> refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > may be modified."
>
> For me, what we've done for bonding tx_prepare/tx_burst() is a really bad
> example.
> Initial agreement and design choice was that tx_burst() should not modify
> contents of the packets
> (that actually was one of the reasons why tx_prepare() was introduced).
> The only reason I agreed on that exception - because I couldn't come-up with
> something less uglier.
>
> Actually, these problems with bonding PMD made me to start thinking that
> current
> tx_prepare/tx_burst approach might need to be reconsidered somehow.
In cases where a preceding call to tx_prepare() is required, how is it worse modifying the packet in tx_burst() than modifying the packet in tx_prepare()?
Both cases violate the postcondition that packets are not modified at egress.
>
> > > Then we can probably have one common tx_prepare() for all vendors ;)
> >
> > Yes, that would be the goal.
> > More realistically, the ethdev layer could perform the common checks, and
> only the non-conforming drivers would have to implement
> > their specific tweaks.
>
> Hmm, but that's what we have right now:
> - fields in mbuf and packet data that user has to fill correctly and dev
> specific tx_prepare().
> How what you suggest will differ then?
You're 100 % right here. We could move more checks into the ethdev layer, specifically checks related to the "minimum requirements".
> And how it will help let say with bonding PMD situation, or with TX-ing of the
> same packet over 2 different NICs?
The bonding driver is broken.
It can only be fixed by not violating the egress postcondition in either tx_burst() or tx_prepare().
"Minimum requirements" might help doing that.
>
> > If we don't standardize the meaning of the offload flags, the application
> developers cannot trust them!
> > I'm afraid this is the current situation - application developers either
> test with specific NIC hardware, or don't use the offload features.
>
> Well, I have used TX offloads through several projects, it worked quite well.
That is good to hear.
And I don't oppose to that.
In this discussion, I am worried about the roadmap direction for DPDK.
I oppose to the concept of requiring calling tx_prepare() before calling tx_burst() when using offload. I think it is conceptually wrong, and breaks the egress postcondition.
I propose "minimum requirements" as a better solution.
> Though have to admit, never have to use TX offloads together with our bonding
> PMD.
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-10 12:20 ` Morten Brørup
@ 2024-04-12 12:46 ` Konstantin Ananyev
2024-04-12 14:44 ` Morten Brørup
0 siblings, 1 reply; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-12 12:46 UTC (permalink / raw)
To: Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > > > > examples.
> > > > >
> > > > > I strongly disagree with this change!
> > > > >
> > > > > It will cause a huge performance degradation for shaping applications:
> > > > >
> > > > > A packet will be processed and finalized at an output or forwarding
> > > > pipeline stage, where some other fields might also be written, so
> > > > > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> > > > cache misses).
> > > > >
> > > > > Then, the packet might be queued for QoS or similar.
> > > > >
> > > > > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> > > > it has to write to the packet and cause a cache miss per packet,
> > > > > instead of simply passing on the packet to the NIC hardware.
> > > > >
> > > > > It must be possible to finalize the packet at the output/forwarding
> > > > pipeline stage!
> > > >
> > > > If you can finalize your packet on output/forwarding, then why you
> > > > can't invoke tx_prepare() on the same stage?
> > > > There seems to be some misunderstanding about what tx_prepare() does -
> > > > in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> > > > etc.), what it does - just make changes in mbuf itself.
> > > > Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> > > > packet, etc.), but AFAIK it is safe
> > > > to call tx_prepare() and tx_burst() from different threads.
> > > > At least on implementations I am aware about.
> > > > Just checked the docs - it seems not stated explicitly anywhere, might
> > > > be that's why it causing such misunderstanding.
> > > >
> > > > >
> > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> > > > egressing on different NIC hardware?
> > > >
> > > > If you create a clone of full packet (including L2/L3) headers then
> > > > obviously such construction might not
> > > > work properly with tx_prepare() over two different NICs.
> > > > Though In majority of cases you do clone segments with data, while at
> > > > least L2 headers are put into different segments.
> > > > One simple approach would be to keep L3 header in that separate segment.
> > > > But yes, there is a problem when you'll need to send exactly the same
> > > > packet over different NICs.
> > > > As I remember, for bonding PMD things don't work quite well here - you
> > > > might have a bond over 2 NICs with
> > > > different tx_prepare() and which one to call might be not clear till
> > > > actual PMD tx_burst() is invoked.
> > > >
> > > > >
> > > > > In theory, it might get even worse if we make this opaque instead of
> > > > transparent and standardized:
> > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> > > > reset it to 0xFFFF.
> > > >
> > > > >
> > > > > I can only see one solution:
> > > > > We need to standardize on common minimum requirements for how to
> > > > prepare packets for each TX offload.
> > > >
> > > > If we can make each and every vendor to agree here - that definitely
> > > > will help to simplify things quite a bit.
> > >
> > > An API is more than a function name and parameters.
> > > It also has preconditions and postconditions.
> > >
> > > All major NIC vendors are contributing to DPDK.
> > > It should be possible to reach consensus for reasonable minimum requirements
> > for offloads.
> > > Hardware- and driver-specific exceptions can be documented with the offload
> > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > rte_eth_rx_burst():
> > > "Some drivers using vector instructions require that nb_pkts is divisible by
> > 4 or 8, depending on the driver implementation."
> >
> > If we introduce a rule that everyone supposed to follow and then straightway
> > allow people to have a 'documented exceptions',
> > for me it means like 'no rule' in practice.
> > A 'documented exceptions' approach might work if you have 5 different PMDs to
> > support, but not when you have 50+.
> > No-one would write an app with possible 10 different exception cases in his
> > head.
> > Again, with such approach we can forget about backward compatibility.
> > I think we already had this discussion before, my opinion remains the same
> > here -
> > 'documented exceptions' approach is a way to trouble.
>
> The "minimum requirements" should be the lowest common denominator of all NICs.
> Exceptions should be extremely few, for outlier NICs that still want to provide an offload and its driver is unable to live up to the
> minimum requirements.
> Any exception should require techboard approval. If a NIC/driver does not support the "minimum requirements" for an offload
> feature, it is not allowed to claim support for that offload feature, or needs to seek approval for an exception.
>
> As another option for NICs not supporting the minimum requirements of an offload feature, we could introduce offload flags with
> finer granularity. E.g. one offload flag for "gold standard" TX checksum update (where the packet's checksum field can have any
> value), and another offload flag for "silver standard" TX checksum update (where the packet's checksum field must have a
> precomputed value).
Actually yes, I was thinking in the same direction - we need some extra API to allow user to distinguish.
Probably we can do something like that: a new API for the ethdev call that would take as a parameter
TX offloads bitmap and in return specify would it need to modify contents of packet to support these
offloads or not.
Something like:
int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
For the majority of the drivers that satisfy these "minimum requirements" corresponding devops
entry will be empty and we'll always return 0, otherwise PMD has to provide a proper devop.
Then again, it would be up to the user, to determine can he pass same packet to 2 different NICs or not.
I suppose it is similar to what you were talking about?
> For reference, consider RSS, where the feature support flags have very high granularity.
>
> >
> > > You mention the bonding driver, which is a good example.
> > > The rte_eth_tx_burst() documentation has a note about the API postcondition
> > exception for the bonding driver:
> > > "This function must not modify mbufs (including packets data) unless the
> > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > may be modified."
> >
> > For me, what we've done for bonding tx_prepare/tx_burst() is a really bad
> > example.
> > Initial agreement and design choice was that tx_burst() should not modify
> > contents of the packets
> > (that actually was one of the reasons why tx_prepare() was introduced).
> > The only reason I agreed on that exception - because I couldn't come-up with
> > something less uglier.
> >
> > Actually, these problems with bonding PMD made me to start thinking that
> > current
> > tx_prepare/tx_burst approach might need to be reconsidered somehow.
>
> In cases where a preceding call to tx_prepare() is required, how is it worse modifying the packet in tx_burst() than modifying the
> packet in tx_prepare()?
>
> Both cases violate the postcondition that packets are not modified at egress.
>
> >
> > > > Then we can probably have one common tx_prepare() for all vendors ;)
> > >
> > > Yes, that would be the goal.
> > > More realistically, the ethdev layer could perform the common checks, and
> > only the non-conforming drivers would have to implement
> > > their specific tweaks.
> >
> > Hmm, but that's what we have right now:
> > - fields in mbuf and packet data that user has to fill correctly and dev
> > specific tx_prepare().
> > How what you suggest will differ then?
>
> You're 100 % right here. We could move more checks into the ethdev layer, specifically checks related to the "minimum
> requirements".
>
> > And how it will help let say with bonding PMD situation, or with TX-ing of the
> > same packet over 2 different NICs?
>
> The bonding driver is broken.
> It can only be fixed by not violating the egress postcondition in either tx_burst() or tx_prepare().
> "Minimum requirements" might help doing that.
>
> >
> > > If we don't standardize the meaning of the offload flags, the application
> > developers cannot trust them!
> > > I'm afraid this is the current situation - application developers either
> > test with specific NIC hardware, or don't use the offload features.
> >
> > Well, I have used TX offloads through several projects, it worked quite well.
>
> That is good to hear.
> And I don't oppose to that.
>
> In this discussion, I am worried about the roadmap direction for DPDK.
> I oppose to the concept of requiring calling tx_prepare() before calling tx_burst() when using offload. I think it is conceptually wrong,
> and breaks the egress postcondition.
> I propose "minimum requirements" as a better solution.
>
> > Though have to admit, never have to use TX offloads together with our bonding
> > PMD.
> >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-12 12:46 ` Konstantin Ananyev
@ 2024-04-12 14:44 ` Morten Brørup
2024-04-12 15:17 ` Konstantin Ananyev
2024-04-15 15:07 ` Ferruh Yigit
0 siblings, 2 replies; 30+ messages in thread
From: Morten Brørup @ 2024-04-12 14:44 UTC (permalink / raw)
To: Konstantin Ananyev, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> offload
> > > > > > > examples.
> > > > > >
> > > > > > I strongly disagree with this change!
> > > > > >
> > > > > > It will cause a huge performance degradation for shaping
> applications:
> > > > > >
> > > > > > A packet will be processed and finalized at an output or
> forwarding
> > > > > pipeline stage, where some other fields might also be written,
> so
> > > > > > zeroing e.g. the out_ip checksum at this stage has low cost
> (no new
> > > > > cache misses).
> > > > > >
> > > > > > Then, the packet might be queued for QoS or similar.
> > > > > >
> > > > > > If rte_eth_tx_prepare() must be called at the egress pipeline
> stage,
> > > > > it has to write to the packet and cause a cache miss per packet,
> > > > > > instead of simply passing on the packet to the NIC hardware.
> > > > > >
> > > > > > It must be possible to finalize the packet at the
> output/forwarding
> > > > > pipeline stage!
> > > > >
> > > > > If you can finalize your packet on output/forwarding, then why
> you
> > > > > can't invoke tx_prepare() on the same stage?
> > > > > There seems to be some misunderstanding about what tx_prepare()
> does -
> > > > > in fact it doesn't communicate with HW queue (doesn't update TXD
> ring,
> > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > Yes, it reads some fields in SW TX queue struct (max number of
> TXDs per
> > > > > packet, etc.), but AFAIK it is safe
> > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > At least on implementations I am aware about.
> > > > > Just checked the docs - it seems not stated explicitly anywhere,
> might
> > > > > be that's why it causing such misunderstanding.
> > > > >
> > > > > >
> > > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned
> packets
> > > > > egressing on different NIC hardware?
> > > > >
> > > > > If you create a clone of full packet (including L2/L3) headers
> then
> > > > > obviously such construction might not
> > > > > work properly with tx_prepare() over two different NICs.
> > > > > Though In majority of cases you do clone segments with data,
> while at
> > > > > least L2 headers are put into different segments.
> > > > > One simple approach would be to keep L3 header in that separate
> segment.
> > > > > But yes, there is a problem when you'll need to send exactly the
> same
> > > > > packet over different NICs.
> > > > > As I remember, for bonding PMD things don't work quite well here
> - you
> > > > > might have a bond over 2 NICs with
> > > > > different tx_prepare() and which one to call might be not clear
> till
> > > > > actual PMD tx_burst() is invoked.
> > > > >
> > > > > >
> > > > > > In theory, it might get even worse if we make this opaque
> instead of
> > > > > transparent and standardized:
> > > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD
> might
> > > > > reset it to 0xFFFF.
> > > > >
> > > > > >
> > > > > > I can only see one solution:
> > > > > > We need to standardize on common minimum requirements for how
> to
> > > > > prepare packets for each TX offload.
> > > > >
> > > > > If we can make each and every vendor to agree here - that
> definitely
> > > > > will help to simplify things quite a bit.
> > > >
> > > > An API is more than a function name and parameters.
> > > > It also has preconditions and postconditions.
> > > >
> > > > All major NIC vendors are contributing to DPDK.
> > > > It should be possible to reach consensus for reasonable minimum
> requirements
> > > for offloads.
> > > > Hardware- and driver-specific exceptions can be documented with
> the offload
> > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > rte_eth_rx_burst():
> > > > "Some drivers using vector instructions require that nb_pkts is
> divisible by
> > > 4 or 8, depending on the driver implementation."
> > >
> > > If we introduce a rule that everyone supposed to follow and then
> straightway
> > > allow people to have a 'documented exceptions',
> > > for me it means like 'no rule' in practice.
> > > A 'documented exceptions' approach might work if you have 5
> different PMDs to
> > > support, but not when you have 50+.
> > > No-one would write an app with possible 10 different exception cases
> in his
> > > head.
> > > Again, with such approach we can forget about backward
> compatibility.
> > > I think we already had this discussion before, my opinion remains
> the same
> > > here -
> > > 'documented exceptions' approach is a way to trouble.
> >
> > The "minimum requirements" should be the lowest common denominator of
> all NICs.
> > Exceptions should be extremely few, for outlier NICs that still want
> to provide an offload and its driver is unable to live up to the
> > minimum requirements.
> > Any exception should require techboard approval. If a NIC/driver does
> not support the "minimum requirements" for an offload
> > feature, it is not allowed to claim support for that offload feature,
> or needs to seek approval for an exception.
> >
> > As another option for NICs not supporting the minimum requirements of
> an offload feature, we could introduce offload flags with
> > finer granularity. E.g. one offload flag for "gold standard" TX
> checksum update (where the packet's checksum field can have any
> > value), and another offload flag for "silver standard" TX checksum
> update (where the packet's checksum field must have a
> > precomputed value).
>
> Actually yes, I was thinking in the same direction - we need some extra
> API to allow user to distinguish.
> Probably we can do something like that: a new API for the ethdev call
> that would take as a parameter
> TX offloads bitmap and in return specify would it need to modify
> contents of packet to support these
> offloads or not.
> Something like:
> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
>
> For the majority of the drivers that satisfy these "minimum
> requirements" corresponding devops
> entry will be empty and we'll always return 0, otherwise PMD has to
> provide a proper devop.
> Then again, it would be up to the user, to determine can he pass same
> packet to 2 different NICs or not.
>
> I suppose it is similar to what you were talking about?
I was thinking something more simple:
The NIC exposes its RX and TX offload capabilities to the application through the rx/tx_offload_capa and other fields in the rte_eth_dev_info structure returned by rte_eth_dev_info_get().
E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag set.
These capability flags (or enums) are mostly undocumented in the code, but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that the NIC is able to update the IPv4 header checksum at egress (on the wire, i.e. without modifying the mbuf or packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this offload.
I would define and document what each capability flag/enum exactly means, the minimum requirements (as defined by the DPDK community) for the driver to claim support for it, and the requirements for an application to use it.
For the sake of discussion, let's say that RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update capability (i.e. no requirements to the checksum field in the packet contents).
If some NIC requires the checksum field in the packet contents to have a precomputed value, the NIC would not be allowed to claim the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
Such a NIC would need to define and document a new capability, e.g. RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX checksum update capability.
In other words: I would encode variations of offload capabilities directly in the capabilities flags.
Then we don't need additional APIs to help interpret those capabilities.
This way, the application can probe the NIC capabilities to determine what can be offloaded, and how to do it.
The application can be designed to:
1. use a common packet processing pipeline, utilizing only the lowest common capabilities denominator of all detected NICs, or
2. use a packet processing pipeline, handling packets differently according to the capabilities of the involved NICs.
NB: There may be other variations than requiring packet contents to be modified, and they might be granular.
E.g. a NIC might require assistance for TCP/UDP checksum offload, but not for IP checksum offload, so a function telling if packet contents requires modification would not suffice.
E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the rte_eth_dev_info structure doesn't expose information about the max number of segments it can handle.
PS: For backwards compatibility, we might define RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to support the current "minimum requirements", and add RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
>
> > For reference, consider RSS, where the feature support flags have very
> high granularity.
> >
> > >
> > > > You mention the bonding driver, which is a good example.
> > > > The rte_eth_tx_burst() documentation has a note about the API
> postcondition
> > > exception for the bonding driver:
> > > > "This function must not modify mbufs (including packets data)
> unless the
> > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > may be modified."
> > >
> > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> really bad
> > > example.
> > > Initial agreement and design choice was that tx_burst() should not
> modify
> > > contents of the packets
> > > (that actually was one of the reasons why tx_prepare() was
> introduced).
> > > The only reason I agreed on that exception - because I couldn't
> come-up with
> > > something less uglier.
> > >
> > > Actually, these problems with bonding PMD made me to start thinking
> that
> > > current
> > > tx_prepare/tx_burst approach might need to be reconsidered somehow.
> >
> > In cases where a preceding call to tx_prepare() is required, how is it
> worse modifying the packet in tx_burst() than modifying the
> > packet in tx_prepare()?
> >
> > Both cases violate the postcondition that packets are not modified at
> egress.
> >
> > >
> > > > > Then we can probably have one common tx_prepare() for all
> vendors ;)
> > > >
> > > > Yes, that would be the goal.
> > > > More realistically, the ethdev layer could perform the common
> checks, and
> > > only the non-conforming drivers would have to implement
> > > > their specific tweaks.
> > >
> > > Hmm, but that's what we have right now:
> > > - fields in mbuf and packet data that user has to fill correctly and
> dev
> > > specific tx_prepare().
> > > How what you suggest will differ then?
> >
> > You're 100 % right here. We could move more checks into the ethdev
> layer, specifically checks related to the "minimum
> > requirements".
> >
> > > And how it will help let say with bonding PMD situation, or with TX-
> ing of the
> > > same packet over 2 different NICs?
> >
> > The bonding driver is broken.
> > It can only be fixed by not violating the egress postcondition in
> either tx_burst() or tx_prepare().
> > "Minimum requirements" might help doing that.
> >
> > >
> > > > If we don't standardize the meaning of the offload flags, the
> application
> > > developers cannot trust them!
> > > > I'm afraid this is the current situation - application developers
> either
> > > test with specific NIC hardware, or don't use the offload features.
> > >
> > > Well, I have used TX offloads through several projects, it worked
> quite well.
> >
> > That is good to hear.
> > And I don't oppose to that.
> >
> > In this discussion, I am worried about the roadmap direction for DPDK.
> > I oppose to the concept of requiring calling tx_prepare() before
> calling tx_burst() when using offload. I think it is conceptually wrong,
> > and breaks the egress postcondition.
> > I propose "minimum requirements" as a better solution.
> >
> > > Though have to admit, never have to use TX offloads together with
> our bonding
> > > PMD.
> > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-12 14:44 ` Morten Brørup
@ 2024-04-12 15:17 ` Konstantin Ananyev
2024-04-12 15:54 ` Morten Brørup
2024-04-15 15:07 ` Ferruh Yigit
1 sibling, 1 reply; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-12 15:17 UTC (permalink / raw)
To: Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
>
> > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> > offload
> > > > > > > > examples.
> > > > > > >
> > > > > > > I strongly disagree with this change!
> > > > > > >
> > > > > > > It will cause a huge performance degradation for shaping
> > applications:
> > > > > > >
> > > > > > > A packet will be processed and finalized at an output or
> > forwarding
> > > > > > pipeline stage, where some other fields might also be written,
> > so
> > > > > > > zeroing e.g. the out_ip checksum at this stage has low cost
> > (no new
> > > > > > cache misses).
> > > > > > >
> > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > >
> > > > > > > If rte_eth_tx_prepare() must be called at the egress pipeline
> > stage,
> > > > > > it has to write to the packet and cause a cache miss per packet,
> > > > > > > instead of simply passing on the packet to the NIC hardware.
> > > > > > >
> > > > > > > It must be possible to finalize the packet at the
> > output/forwarding
> > > > > > pipeline stage!
> > > > > >
> > > > > > If you can finalize your packet on output/forwarding, then why
> > you
> > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > There seems to be some misunderstanding about what tx_prepare()
> > does -
> > > > > > in fact it doesn't communicate with HW queue (doesn't update TXD
> > ring,
> > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > Yes, it reads some fields in SW TX queue struct (max number of
> > TXDs per
> > > > > > packet, etc.), but AFAIK it is safe
> > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > At least on implementations I am aware about.
> > > > > > Just checked the docs - it seems not stated explicitly anywhere,
> > might
> > > > > > be that's why it causing such misunderstanding.
> > > > > >
> > > > > > >
> > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned
> > packets
> > > > > > egressing on different NIC hardware?
> > > > > >
> > > > > > If you create a clone of full packet (including L2/L3) headers
> > then
> > > > > > obviously such construction might not
> > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > Though In majority of cases you do clone segments with data,
> > while at
> > > > > > least L2 headers are put into different segments.
> > > > > > One simple approach would be to keep L3 header in that separate
> > segment.
> > > > > > But yes, there is a problem when you'll need to send exactly the
> > same
> > > > > > packet over different NICs.
> > > > > > As I remember, for bonding PMD things don't work quite well here
> > - you
> > > > > > might have a bond over 2 NICs with
> > > > > > different tx_prepare() and which one to call might be not clear
> > till
> > > > > > actual PMD tx_burst() is invoked.
> > > > > >
> > > > > > >
> > > > > > > In theory, it might get even worse if we make this opaque
> > instead of
> > > > > > transparent and standardized:
> > > > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD
> > might
> > > > > > reset it to 0xFFFF.
> > > > > >
> > > > > > >
> > > > > > > I can only see one solution:
> > > > > > > We need to standardize on common minimum requirements for how
> > to
> > > > > > prepare packets for each TX offload.
> > > > > >
> > > > > > If we can make each and every vendor to agree here - that
> > definitely
> > > > > > will help to simplify things quite a bit.
> > > > >
> > > > > An API is more than a function name and parameters.
> > > > > It also has preconditions and postconditions.
> > > > >
> > > > > All major NIC vendors are contributing to DPDK.
> > > > > It should be possible to reach consensus for reasonable minimum
> > requirements
> > > > for offloads.
> > > > > Hardware- and driver-specific exceptions can be documented with
> > the offload
> > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > rte_eth_rx_burst():
> > > > > "Some drivers using vector instructions require that nb_pkts is
> > divisible by
> > > > 4 or 8, depending on the driver implementation."
> > > >
> > > > If we introduce a rule that everyone supposed to follow and then
> > straightway
> > > > allow people to have a 'documented exceptions',
> > > > for me it means like 'no rule' in practice.
> > > > A 'documented exceptions' approach might work if you have 5
> > different PMDs to
> > > > support, but not when you have 50+.
> > > > No-one would write an app with possible 10 different exception cases
> > in his
> > > > head.
> > > > Again, with such approach we can forget about backward
> > compatibility.
> > > > I think we already had this discussion before, my opinion remains
> > the same
> > > > here -
> > > > 'documented exceptions' approach is a way to trouble.
> > >
> > > The "minimum requirements" should be the lowest common denominator of
> > all NICs.
> > > Exceptions should be extremely few, for outlier NICs that still want
> > to provide an offload and its driver is unable to live up to the
> > > minimum requirements.
> > > Any exception should require techboard approval. If a NIC/driver does
> > not support the "minimum requirements" for an offload
> > > feature, it is not allowed to claim support for that offload feature,
> > or needs to seek approval for an exception.
> > >
> > > As another option for NICs not supporting the minimum requirements of
> > an offload feature, we could introduce offload flags with
> > > finer granularity. E.g. one offload flag for "gold standard" TX
> > checksum update (where the packet's checksum field can have any
> > > value), and another offload flag for "silver standard" TX checksum
> > update (where the packet's checksum field must have a
> > > precomputed value).
> >
> > Actually yes, I was thinking in the same direction - we need some extra
> > API to allow user to distinguish.
> > Probably we can do something like that: a new API for the ethdev call
> > that would take as a parameter
> > TX offloads bitmap and in return specify would it need to modify
> > contents of packet to support these
> > offloads or not.
> > Something like:
> > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> >
> > For the majority of the drivers that satisfy these "minimum
> > requirements" corresponding devops
> > entry will be empty and we'll always return 0, otherwise PMD has to
> > provide a proper devop.
> > Then again, it would be up to the user, to determine can he pass same
> > packet to 2 different NICs or not.
> >
> > I suppose it is similar to what you were talking about?
>
> I was thinking something more simple:
>
> The NIC exposes its RX and TX offload capabilities to the application through the rx/tx_offload_capa and other fields in the
> rte_eth_dev_info structure returned by rte_eth_dev_info_get().
>
> E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag set.
> These capability flags (or enums) are mostly undocumented in the code, but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> capability means that the NIC is able to update the IPv4 header checksum at egress (on the wire, i.e. without modifying the mbuf or
> packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this offload.
> I would define and document what each capability flag/enum exactly means, the minimum requirements (as defined by the DPDK
> community) for the driver to claim support for it, and the requirements for an application to use it.
> For the sake of discussion, let's say that RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update capability
> (i.e. no requirements to the checksum field in the packet contents).
> If some NIC requires the checksum field in the packet contents to have a precomputed value, the NIC would not be allowed to claim
> the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> Such a NIC would need to define and document a new capability, e.g. RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> standard" TX checksum update capability.
> In other words: I would encode variations of offload capabilities directly in the capabilities flags.
> Then we don't need additional APIs to help interpret those capabilities.
I understood your intention with different flags, yes it should work too I think.
The reason I am not very fond of it - it will require to double TX_OFFLOAD flags.
> This way, the application can probe the NIC capabilities to determine what can be offloaded, and how to do it.
>
> The application can be designed to:
> 1. use a common packet processing pipeline, utilizing only the lowest common capabilities denominator of all detected NICs, or
> 2. use a packet processing pipeline, handling packets differently according to the capabilities of the involved NICs.
>
> NB: There may be other variations than requiring packet contents to be modified, and they might be granular.
> E.g. a NIC might require assistance for TCP/UDP checksum offload, but not for IP checksum offload, so a function telling if packet
> contents requires modification would not suffice.
Why not?
If user plans to use multiple tx offloads provide a bitmask of all of them as an argument.
Let say for both L3 and L4 cksum offloads it will be something like:
(RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)
> E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the rte_eth_dev_info structure doesn't expose information about the max
> number of segments it can handle.
>
> PS: For backwards compatibility, we might define RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to support the
> current "minimum requirements", and add RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
>
>
> >
> > > For reference, consider RSS, where the feature support flags have very
> > high granularity.
> > >
> > > >
> > > > > You mention the bonding driver, which is a good example.
> > > > > The rte_eth_tx_burst() documentation has a note about the API
> > postcondition
> > > > exception for the bonding driver:
> > > > > "This function must not modify mbufs (including packets data)
> > unless the
> > > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > > may be modified."
> > > >
> > > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> > really bad
> > > > example.
> > > > Initial agreement and design choice was that tx_burst() should not
> > modify
> > > > contents of the packets
> > > > (that actually was one of the reasons why tx_prepare() was
> > introduced).
> > > > The only reason I agreed on that exception - because I couldn't
> > come-up with
> > > > something less uglier.
> > > >
> > > > Actually, these problems with bonding PMD made me to start thinking
> > that
> > > > current
> > > > tx_prepare/tx_burst approach might need to be reconsidered somehow.
> > >
> > > In cases where a preceding call to tx_prepare() is required, how is it
> > worse modifying the packet in tx_burst() than modifying the
> > > packet in tx_prepare()?
> > >
> > > Both cases violate the postcondition that packets are not modified at
> > egress.
> > >
> > > >
> > > > > > Then we can probably have one common tx_prepare() for all
> > vendors ;)
> > > > >
> > > > > Yes, that would be the goal.
> > > > > More realistically, the ethdev layer could perform the common
> > checks, and
> > > > only the non-conforming drivers would have to implement
> > > > > their specific tweaks.
> > > >
> > > > Hmm, but that's what we have right now:
> > > > - fields in mbuf and packet data that user has to fill correctly and
> > dev
> > > > specific tx_prepare().
> > > > How what you suggest will differ then?
> > >
> > > You're 100 % right here. We could move more checks into the ethdev
> > layer, specifically checks related to the "minimum
> > > requirements".
> > >
> > > > And how it will help let say with bonding PMD situation, or with TX-
> > ing of the
> > > > same packet over 2 different NICs?
> > >
> > > The bonding driver is broken.
> > > It can only be fixed by not violating the egress postcondition in
> > either tx_burst() or tx_prepare().
> > > "Minimum requirements" might help doing that.
> > >
> > > >
> > > > > If we don't standardize the meaning of the offload flags, the
> > application
> > > > developers cannot trust them!
> > > > > I'm afraid this is the current situation - application developers
> > either
> > > > test with specific NIC hardware, or don't use the offload features.
> > > >
> > > > Well, I have used TX offloads through several projects, it worked
> > quite well.
> > >
> > > That is good to hear.
> > > And I don't oppose to that.
> > >
> > > In this discussion, I am worried about the roadmap direction for DPDK.
> > > I oppose to the concept of requiring calling tx_prepare() before
> > calling tx_burst() when using offload. I think it is conceptually wrong,
> > > and breaks the egress postcondition.
> > > I propose "minimum requirements" as a better solution.
> > >
> > > > Though have to admit, never have to use TX offloads together with
> > our bonding
> > > > PMD.
> > > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-12 15:17 ` Konstantin Ananyev
@ 2024-04-12 15:54 ` Morten Brørup
2024-04-16 9:16 ` Konstantin Ananyev
0 siblings, 1 reply; 30+ messages in thread
From: Morten Brørup @ 2024-04-12 15:54 UTC (permalink / raw)
To: Konstantin Ananyev, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx
> checksum
> > > offload
> > > > > > > > > examples.
> > > > > > > >
> > > > > > > > I strongly disagree with this change!
> > > > > > > >
> > > > > > > > It will cause a huge performance degradation for shaping
> > > applications:
> > > > > > > >
> > > > > > > > A packet will be processed and finalized at an output or
> > > forwarding
> > > > > > > pipeline stage, where some other fields might also be
> written,
> > > so
> > > > > > > > zeroing e.g. the out_ip checksum at this stage has low
> cost
> > > (no new
> > > > > > > cache misses).
> > > > > > > >
> > > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > > >
> > > > > > > > If rte_eth_tx_prepare() must be called at the egress
> pipeline
> > > stage,
> > > > > > > it has to write to the packet and cause a cache miss per
> packet,
> > > > > > > > instead of simply passing on the packet to the NIC
> hardware.
> > > > > > > >
> > > > > > > > It must be possible to finalize the packet at the
> > > output/forwarding
> > > > > > > pipeline stage!
> > > > > > >
> > > > > > > If you can finalize your packet on output/forwarding, then
> why
> > > you
> > > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > > There seems to be some misunderstanding about what
> tx_prepare()
> > > does -
> > > > > > > in fact it doesn't communicate with HW queue (doesn't update
> TXD
> > > ring,
> > > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > > Yes, it reads some fields in SW TX queue struct (max number
> of
> > > TXDs per
> > > > > > > packet, etc.), but AFAIK it is safe
> > > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > > At least on implementations I am aware about.
> > > > > > > Just checked the docs - it seems not stated explicitly
> anywhere,
> > > might
> > > > > > > be that's why it causing such misunderstanding.
> > > > > > >
> > > > > > > >
> > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for
> cloned
> > > packets
> > > > > > > egressing on different NIC hardware?
> > > > > > >
> > > > > > > If you create a clone of full packet (including L2/L3)
> headers
> > > then
> > > > > > > obviously such construction might not
> > > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > > Though In majority of cases you do clone segments with data,
> > > while at
> > > > > > > least L2 headers are put into different segments.
> > > > > > > One simple approach would be to keep L3 header in that
> separate
> > > segment.
> > > > > > > But yes, there is a problem when you'll need to send exactly
> the
> > > same
> > > > > > > packet over different NICs.
> > > > > > > As I remember, for bonding PMD things don't work quite well
> here
> > > - you
> > > > > > > might have a bond over 2 NICs with
> > > > > > > different tx_prepare() and which one to call might be not
> clear
> > > till
> > > > > > > actual PMD tx_burst() is invoked.
> > > > > > >
> > > > > > > >
> > > > > > > > In theory, it might get even worse if we make this opaque
> > > instead of
> > > > > > > transparent and standardized:
> > > > > > > > One PMD might reset out_ip checksum to 0x0000, and another
> PMD
> > > might
> > > > > > > reset it to 0xFFFF.
> > > > > > >
> > > > > > > >
> > > > > > > > I can only see one solution:
> > > > > > > > We need to standardize on common minimum requirements for
> how
> > > to
> > > > > > > prepare packets for each TX offload.
> > > > > > >
> > > > > > > If we can make each and every vendor to agree here - that
> > > definitely
> > > > > > > will help to simplify things quite a bit.
> > > > > >
> > > > > > An API is more than a function name and parameters.
> > > > > > It also has preconditions and postconditions.
> > > > > >
> > > > > > All major NIC vendors are contributing to DPDK.
> > > > > > It should be possible to reach consensus for reasonable
> minimum
> > > requirements
> > > > > for offloads.
> > > > > > Hardware- and driver-specific exceptions can be documented
> with
> > > the offload
> > > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > > rte_eth_rx_burst():
> > > > > > "Some drivers using vector instructions require that nb_pkts
> is
> > > divisible by
> > > > > 4 or 8, depending on the driver implementation."
> > > > >
> > > > > If we introduce a rule that everyone supposed to follow and then
> > > straightway
> > > > > allow people to have a 'documented exceptions',
> > > > > for me it means like 'no rule' in practice.
> > > > > A 'documented exceptions' approach might work if you have 5
> > > different PMDs to
> > > > > support, but not when you have 50+.
> > > > > No-one would write an app with possible 10 different exception
> cases
> > > in his
> > > > > head.
> > > > > Again, with such approach we can forget about backward
> > > compatibility.
> > > > > I think we already had this discussion before, my opinion
> remains
> > > the same
> > > > > here -
> > > > > 'documented exceptions' approach is a way to trouble.
> > > >
> > > > The "minimum requirements" should be the lowest common denominator
> of
> > > all NICs.
> > > > Exceptions should be extremely few, for outlier NICs that still
> want
> > > to provide an offload and its driver is unable to live up to the
> > > > minimum requirements.
> > > > Any exception should require techboard approval. If a NIC/driver
> does
> > > not support the "minimum requirements" for an offload
> > > > feature, it is not allowed to claim support for that offload
> feature,
> > > or needs to seek approval for an exception.
> > > >
> > > > As another option for NICs not supporting the minimum requirements
> of
> > > an offload feature, we could introduce offload flags with
> > > > finer granularity. E.g. one offload flag for "gold standard" TX
> > > checksum update (where the packet's checksum field can have any
> > > > value), and another offload flag for "silver standard" TX checksum
> > > update (where the packet's checksum field must have a
> > > > precomputed value).
> > >
> > > Actually yes, I was thinking in the same direction - we need some
> extra
> > > API to allow user to distinguish.
> > > Probably we can do something like that: a new API for the ethdev
> call
> > > that would take as a parameter
> > > TX offloads bitmap and in return specify would it need to modify
> > > contents of packet to support these
> > > offloads or not.
> > > Something like:
> > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > >
> > > For the majority of the drivers that satisfy these "minimum
> > > requirements" corresponding devops
> > > entry will be empty and we'll always return 0, otherwise PMD has to
> > > provide a proper devop.
> > > Then again, it would be up to the user, to determine can he pass
> same
> > > packet to 2 different NICs or not.
> > >
> > > I suppose it is similar to what you were talking about?
> >
> > I was thinking something more simple:
> >
> > The NIC exposes its RX and TX offload capabilities to the application
> through the rx/tx_offload_capa and other fields in the
> > rte_eth_dev_info structure returned by rte_eth_dev_info_get().
> >
> > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> set.
> > These capability flags (or enums) are mostly undocumented in the code,
> but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> > capability means that the NIC is able to update the IPv4 header
> checksum at egress (on the wire, i.e. without modifying the mbuf or
> > packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM
> in the mbufs to utilize this offload.
> > I would define and document what each capability flag/enum exactly
> means, the minimum requirements (as defined by the DPDK
> > community) for the driver to claim support for it, and the
> requirements for an application to use it.
> > For the sake of discussion, let's say that
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> capability
> > (i.e. no requirements to the checksum field in the packet contents).
> > If some NIC requires the checksum field in the packet contents to have
> a precomputed value, the NIC would not be allowed to claim
> > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > Such a NIC would need to define and document a new capability, e.g.
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> > standard" TX checksum update capability.
> > In other words: I would encode variations of offload capabilities
> directly in the capabilities flags.
> > Then we don't need additional APIs to help interpret those
> capabilities.
>
> I understood your intention with different flags, yes it should work too
> I think.
> The reason I am not very fond of it - it will require to double
> TX_OFFLOAD flags.
An additional feature flag is only required if a NIC is not conforming to the "minimum requirements" of an offload feature, and the techboard permits introducing a variant of an existing feature.
There should be very few additional feature flags for variants - exceptions only - or the "minimum requirements" are not broad enough to support the majority of NICs.
>
> > This way, the application can probe the NIC capabilities to determine
> what can be offloaded, and how to do it.
> >
> > The application can be designed to:
> > 1. use a common packet processing pipeline, utilizing only the lowest
> common capabilities denominator of all detected NICs, or
> > 2. use a packet processing pipeline, handling packets differently
> according to the capabilities of the involved NICs.
> >
> > NB: There may be other variations than requiring packet contents to be
> modified, and they might be granular.
> > E.g. a NIC might require assistance for TCP/UDP checksum offload, but
> not for IP checksum offload, so a function telling if packet
> > contents requires modification would not suffice.
>
> Why not?
> If user plans to use multiple tx offloads provide a bitmask of all of
> them as an argument.
> Let say for both L3 and L4 cksum offloads it will be something like:
> (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
> RTE_ETH_TX_OFFLOAD_TCP_CKSUM)
You are partially right; the offload flags can be tested one by one to determine which header fields in the packet contents need to be updated.
I'm assuming the suggested function returns a Boolean; so it doesn't tell the application how it should modify the packet contents, e.g. if the header's checksum field must be zeroed or contain the precomputed checksum of the pseudo header.
Alternatively, if it returns a bitfield with flags for different types of modification, the information could be encoded that way. But then they might as well be returned as offload capability variation flags in the rte_eth_dev_info structure's tx_offload_capa field, as I'm advocating for.
>
> > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the
> rte_eth_dev_info structure doesn't expose information about the max
> > number of segments it can handle.
> >
> > PS: For backwards compatibility, we might define
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to
> support the
> > current "minimum requirements", and add
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
> >
> >
> > >
> > > > For reference, consider RSS, where the feature support flags have
> very
> > > high granularity.
> > > >
> > > > >
> > > > > > You mention the bonding driver, which is a good example.
> > > > > > The rte_eth_tx_burst() documentation has a note about the API
> > > postcondition
> > > > > exception for the bonding driver:
> > > > > > "This function must not modify mbufs (including packets data)
> > > unless the
> > > > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > > > may be modified."
> > > > >
> > > > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> > > really bad
> > > > > example.
> > > > > Initial agreement and design choice was that tx_burst() should
> not
> > > modify
> > > > > contents of the packets
> > > > > (that actually was one of the reasons why tx_prepare() was
> > > introduced).
> > > > > The only reason I agreed on that exception - because I couldn't
> > > come-up with
> > > > > something less uglier.
> > > > >
> > > > > Actually, these problems with bonding PMD made me to start
> thinking
> > > that
> > > > > current
> > > > > tx_prepare/tx_burst approach might need to be reconsidered
> somehow.
> > > >
> > > > In cases where a preceding call to tx_prepare() is required, how
> is it
> > > worse modifying the packet in tx_burst() than modifying the
> > > > packet in tx_prepare()?
> > > >
> > > > Both cases violate the postcondition that packets are not modified
> at
> > > egress.
> > > >
> > > > >
> > > > > > > Then we can probably have one common tx_prepare() for all
> > > vendors ;)
> > > > > >
> > > > > > Yes, that would be the goal.
> > > > > > More realistically, the ethdev layer could perform the common
> > > checks, and
> > > > > only the non-conforming drivers would have to implement
> > > > > > their specific tweaks.
> > > > >
> > > > > Hmm, but that's what we have right now:
> > > > > - fields in mbuf and packet data that user has to fill correctly
> and
> > > dev
> > > > > specific tx_prepare().
> > > > > How what you suggest will differ then?
> > > >
> > > > You're 100 % right here. We could move more checks into the ethdev
> > > layer, specifically checks related to the "minimum
> > > > requirements".
> > > >
> > > > > And how it will help let say with bonding PMD situation, or with
> TX-
> > > ing of the
> > > > > same packet over 2 different NICs?
> > > >
> > > > The bonding driver is broken.
> > > > It can only be fixed by not violating the egress postcondition in
> > > either tx_burst() or tx_prepare().
> > > > "Minimum requirements" might help doing that.
> > > >
> > > > >
> > > > > > If we don't standardize the meaning of the offload flags, the
> > > application
> > > > > developers cannot trust them!
> > > > > > I'm afraid this is the current situation - application
> developers
> > > either
> > > > > test with specific NIC hardware, or don't use the offload
> features.
> > > > >
> > > > > Well, I have used TX offloads through several projects, it
> worked
> > > quite well.
> > > >
> > > > That is good to hear.
> > > > And I don't oppose to that.
> > > >
> > > > In this discussion, I am worried about the roadmap direction for
> DPDK.
> > > > I oppose to the concept of requiring calling tx_prepare() before
> > > calling tx_burst() when using offload. I think it is conceptually
> wrong,
> > > > and breaks the egress postcondition.
> > > > I propose "minimum requirements" as a better solution.
> > > >
> > > > > Though have to admit, never have to use TX offloads together
> with
> > > our bonding
> > > > > PMD.
> > > > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-12 14:44 ` Morten Brørup
2024-04-12 15:17 ` Konstantin Ananyev
@ 2024-04-15 15:07 ` Ferruh Yigit
2024-04-16 7:14 ` Morten Brørup
1 sibling, 1 reply; 30+ messages in thread
From: Ferruh Yigit @ 2024-04-15 15:07 UTC (permalink / raw)
To: Morten Brørup, Konstantin Ananyev, David Marchand, dev
Cc: thomas, stable, Olivier Matz, Jijiang Liu, Andrew Rybchenko,
Kaiwen Deng, qiming.yang, yidingx.zhou, Aman Singh, Yuying Zhang,
Jerin Jacob
On 4/12/2024 3:44 PM, Morten Brørup wrote:
>>>>>>>> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
>> offload
>>>>>>>> examples.
>>>>>>>
>>>>>>> I strongly disagree with this change!
>>>>>>>
>>>>>>> It will cause a huge performance degradation for shaping
>> applications:
>>>>>>>
>>>>>>> A packet will be processed and finalized at an output or
>> forwarding
>>>>>> pipeline stage, where some other fields might also be written,
>> so
>>>>>>> zeroing e.g. the out_ip checksum at this stage has low cost
>> (no new
>>>>>> cache misses).
>>>>>>>
>>>>>>> Then, the packet might be queued for QoS or similar.
>>>>>>>
>>>>>>> If rte_eth_tx_prepare() must be called at the egress pipeline
>> stage,
>>>>>> it has to write to the packet and cause a cache miss per packet,
>>>>>>> instead of simply passing on the packet to the NIC hardware.
>>>>>>>
>>>>>>> It must be possible to finalize the packet at the
>> output/forwarding
>>>>>> pipeline stage!
>>>>>>
>>>>>> If you can finalize your packet on output/forwarding, then why
>> you
>>>>>> can't invoke tx_prepare() on the same stage?
>>>>>> There seems to be some misunderstanding about what tx_prepare()
>> does -
>>>>>> in fact it doesn't communicate with HW queue (doesn't update TXD
>> ring,
>>>>>> etc.), what it does - just make changes in mbuf itself.
>>>>>> Yes, it reads some fields in SW TX queue struct (max number of
>> TXDs per
>>>>>> packet, etc.), but AFAIK it is safe
>>>>>> to call tx_prepare() and tx_burst() from different threads.
>>>>>> At least on implementations I am aware about.
>>>>>> Just checked the docs - it seems not stated explicitly anywhere,
>> might
>>>>>> be that's why it causing such misunderstanding.
>>>>>>
>>>>>>>
>>>>>>> Also, how is rte_eth_tx_prepare() supposed to work for cloned
>> packets
>>>>>> egressing on different NIC hardware?
>>>>>>
>>>>>> If you create a clone of full packet (including L2/L3) headers
>> then
>>>>>> obviously such construction might not
>>>>>> work properly with tx_prepare() over two different NICs.
>>>>>> Though In majority of cases you do clone segments with data,
>> while at
>>>>>> least L2 headers are put into different segments.
>>>>>> One simple approach would be to keep L3 header in that separate
>> segment.
>>>>>> But yes, there is a problem when you'll need to send exactly the
>> same
>>>>>> packet over different NICs.
>>>>>> As I remember, for bonding PMD things don't work quite well here
>> - you
>>>>>> might have a bond over 2 NICs with
>>>>>> different tx_prepare() and which one to call might be not clear
>> till
>>>>>> actual PMD tx_burst() is invoked.
>>>>>>
>>>>>>>
>>>>>>> In theory, it might get even worse if we make this opaque
>> instead of
>>>>>> transparent and standardized:
>>>>>>> One PMD might reset out_ip checksum to 0x0000, and another PMD
>> might
>>>>>> reset it to 0xFFFF.
>>>>>>
>>>>>>>
>>>>>>> I can only see one solution:
>>>>>>> We need to standardize on common minimum requirements for how
>> to
>>>>>> prepare packets for each TX offload.
>>>>>>
>>>>>> If we can make each and every vendor to agree here - that
>> definitely
>>>>>> will help to simplify things quite a bit.
>>>>>
>>>>> An API is more than a function name and parameters.
>>>>> It also has preconditions and postconditions.
>>>>>
>>>>> All major NIC vendors are contributing to DPDK.
>>>>> It should be possible to reach consensus for reasonable minimum
>> requirements
>>>> for offloads.
>>>>> Hardware- and driver-specific exceptions can be documented with
>> the offload
>>>> flag, or with rte_eth_rx/tx_burst(), like the note to
>>>>> rte_eth_rx_burst():
>>>>> "Some drivers using vector instructions require that nb_pkts is
>> divisible by
>>>> 4 or 8, depending on the driver implementation."
>>>>
>>>> If we introduce a rule that everyone supposed to follow and then
>> straightway
>>>> allow people to have a 'documented exceptions',
>>>> for me it means like 'no rule' in practice.
>>>> A 'documented exceptions' approach might work if you have 5
>> different PMDs to
>>>> support, but not when you have 50+.
>>>> No-one would write an app with possible 10 different exception cases
>> in his
>>>> head.
>>>> Again, with such approach we can forget about backward
>> compatibility.
>>>> I think we already had this discussion before, my opinion remains
>> the same
>>>> here -
>>>> 'documented exceptions' approach is a way to trouble.
>>>
>>> The "minimum requirements" should be the lowest common denominator of
>> all NICs.
>>> Exceptions should be extremely few, for outlier NICs that still want
>> to provide an offload and its driver is unable to live up to the
>>> minimum requirements.
>>> Any exception should require techboard approval. If a NIC/driver does
>> not support the "minimum requirements" for an offload
>>> feature, it is not allowed to claim support for that offload feature,
>> or needs to seek approval for an exception.
>>>
>>> As another option for NICs not supporting the minimum requirements of
>> an offload feature, we could introduce offload flags with
>>> finer granularity. E.g. one offload flag for "gold standard" TX
>> checksum update (where the packet's checksum field can have any
>>> value), and another offload flag for "silver standard" TX checksum
>> update (where the packet's checksum field must have a
>>> precomputed value).
>>
>> Actually yes, I was thinking in the same direction - we need some extra
>> API to allow user to distinguish.
>> Probably we can do something like that: a new API for the ethdev call
>> that would take as a parameter
>> TX offloads bitmap and in return specify would it need to modify
>> contents of packet to support these
>> offloads or not.
>> Something like:
>> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
>>
>> For the majority of the drivers that satisfy these "minimum
>> requirements" corresponding devops
>> entry will be empty and we'll always return 0, otherwise PMD has to
>> provide a proper devop.
>> Then again, it would be up to the user, to determine can he pass same
>> packet to 2 different NICs or not.
>>
>> I suppose it is similar to what you were talking about?
>
> I was thinking something more simple:
>
> The NIC exposes its RX and TX offload capabilities to the application through the rx/tx_offload_capa and other fields in the rte_eth_dev_info structure returned by rte_eth_dev_info_get().
>
> E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag set.
> These capability flags (or enums) are mostly undocumented in the code, but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that the NIC is able to update the IPv4 header checksum at egress (on the wire, i.e. without modifying the mbuf or packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this offload.
> I would define and document what each capability flag/enum exactly means, the minimum requirements (as defined by the DPDK community) for the driver to claim support for it, and the requirements for an application to use it.
>
+1 to improve documentation, and clear offload where it is needed.
Another gap is in testing, whenever a device/driver claims an offload
capability, we don't have a test suit to confirm and verify this claim.
> For the sake of discussion, let's say that RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update capability (i.e. no requirements to the checksum field in the packet contents).
> If some NIC requires the checksum field in the packet contents to have a precomputed value, the NIC would not be allowed to claim the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> Such a NIC would need to define and document a new capability, e.g. RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX checksum update capability.
>
> In other words: I would encode variations of offload capabilities directly in the capabilities flags.
> Then we don't need additional APIs to help interpret those capabilities.
>
> This way, the application can probe the NIC capabilities to determine what can be offloaded, and how to do it.
>
> The application can be designed to:
> 1. use a common packet processing pipeline, utilizing only the lowest common capabilities denominator of all detected NICs, or
> 2. use a packet processing pipeline, handling packets differently according to the capabilities of the involved NICs.
>
Offload capabilities are already provided to enable applications as you
mentioned above.
Agree that '_ASSISTED' capability flags gives more details to
application to manage it but my concern is it complicates offloading more.
The number of "assisted" offloads is not much, mainly it is for cksum.
Current approach is simpler, devices requires precondition implements it
in 'tx_prepare' and application using these offloads calls before
'tx_burst'. Device/drivers doesn't require it don't have 'tx_prepare' so
no impact to the application.
Will it work to have something in between, another capability to hold if
'tx_prepare' needs to be called with this device or not?
It can be possible to make this variable 32bits that holds offloads
require assistance in a bit-wise way, but I prefer simple if
'tx_prepare' required flag, this may help application to decide to use
offloads in that device and to call 'tx_prepare' or not.
> NB: There may be other variations than requiring packet contents to be modified, and they might be granular.
> E.g. a NIC might require assistance for TCP/UDP checksum offload, but not for IP checksum offload, so a function telling if packet contents requires modification would not suffice.
> E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the rte_eth_dev_info structure doesn't expose information about the max number of segments it can handle.
>
Another good point to report max number of segment can be handled, +1
> PS: For backwards compatibility, we might define RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to support the current "minimum requirements", and add RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
>
>
>>
>>> For reference, consider RSS, where the feature support flags have very
>> high granularity.
>>>
>>>>
>>>>> You mention the bonding driver, which is a good example.
>>>>> The rte_eth_tx_burst() documentation has a note about the API
>> postcondition
>>>> exception for the bonding driver:
>>>>> "This function must not modify mbufs (including packets data)
>> unless the
>>>> refcnt is 1. An exception is the bonding PMD, [...], mbufs
>>>>> may be modified."
>>>>
>>>> For me, what we've done for bonding tx_prepare/tx_burst() is a
>> really bad
>>>> example.
>>>> Initial agreement and design choice was that tx_burst() should not
>> modify
>>>> contents of the packets
>>>> (that actually was one of the reasons why tx_prepare() was
>> introduced).
>>>> The only reason I agreed on that exception - because I couldn't
>> come-up with
>>>> something less uglier.
>>>>
>>>> Actually, these problems with bonding PMD made me to start thinking
>> that
>>>> current
>>>> tx_prepare/tx_burst approach might need to be reconsidered somehow.
>>>
>>> In cases where a preceding call to tx_prepare() is required, how is it
>> worse modifying the packet in tx_burst() than modifying the
>>> packet in tx_prepare()?
>>>
>>> Both cases violate the postcondition that packets are not modified at
>> egress.
>>>
>>>>
>>>>>> Then we can probably have one common tx_prepare() for all
>> vendors ;)
>>>>>
>>>>> Yes, that would be the goal.
>>>>> More realistically, the ethdev layer could perform the common
>> checks, and
>>>> only the non-conforming drivers would have to implement
>>>>> their specific tweaks.
>>>>
>>>> Hmm, but that's what we have right now:
>>>> - fields in mbuf and packet data that user has to fill correctly and
>> dev
>>>> specific tx_prepare().
>>>> How what you suggest will differ then?
>>>
>>> You're 100 % right here. We could move more checks into the ethdev
>> layer, specifically checks related to the "minimum
>>> requirements".
>>>
>>>> And how it will help let say with bonding PMD situation, or with TX-
>> ing of the
>>>> same packet over 2 different NICs?
>>>
>>> The bonding driver is broken.
>>> It can only be fixed by not violating the egress postcondition in
>> either tx_burst() or tx_prepare().
>>> "Minimum requirements" might help doing that.
>>>
>>>>
>>>>> If we don't standardize the meaning of the offload flags, the
>> application
>>>> developers cannot trust them!
>>>>> I'm afraid this is the current situation - application developers
>> either
>>>> test with specific NIC hardware, or don't use the offload features.
>>>>
>>>> Well, I have used TX offloads through several projects, it worked
>> quite well.
>>>
>>> That is good to hear.
>>> And I don't oppose to that.
>>>
>>> In this discussion, I am worried about the roadmap direction for DPDK.
>>> I oppose to the concept of requiring calling tx_prepare() before
>> calling tx_burst() when using offload. I think it is conceptually wrong,
>>> and breaks the egress postcondition.
>>> I propose "minimum requirements" as a better solution.
>>>
>>>> Though have to admit, never have to use TX offloads together with
>> our bonding
>>>> PMD.
>>>>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-15 15:07 ` Ferruh Yigit
@ 2024-04-16 7:14 ` Morten Brørup
2024-04-16 9:26 ` Konstantin Ananyev
0 siblings, 1 reply; 30+ messages in thread
From: Morten Brørup @ 2024-04-16 7:14 UTC (permalink / raw)
To: Ferruh Yigit, Konstantin Ananyev, David Marchand, dev
Cc: thomas, stable, Olivier Matz, Jijiang Liu, Andrew Rybchenko,
Kaiwen Deng, qiming.yang, yidingx.zhou, Aman Singh, Yuying Zhang,
Jerin Jacob
> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> Sent: Monday, 15 April 2024 17.08
>
> On 4/12/2024 3:44 PM, Morten Brørup wrote:
> >>>>>>>> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> >> offload
> >>>>>>>> examples.
> >>>>>>>
> >>>>>>> I strongly disagree with this change!
> >>>>>>>
> >>>>>>> It will cause a huge performance degradation for shaping
> >> applications:
> >>>>>>>
> >>>>>>> A packet will be processed and finalized at an output or
> >> forwarding
> >>>>>> pipeline stage, where some other fields might also be written,
> >> so
> >>>>>>> zeroing e.g. the out_ip checksum at this stage has low cost
> >> (no new
> >>>>>> cache misses).
> >>>>>>>
> >>>>>>> Then, the packet might be queued for QoS or similar.
> >>>>>>>
> >>>>>>> If rte_eth_tx_prepare() must be called at the egress pipeline
> >> stage,
> >>>>>> it has to write to the packet and cause a cache miss per packet,
> >>>>>>> instead of simply passing on the packet to the NIC hardware.
> >>>>>>>
> >>>>>>> It must be possible to finalize the packet at the
> >> output/forwarding
> >>>>>> pipeline stage!
> >>>>>>
> >>>>>> If you can finalize your packet on output/forwarding, then why
> >> you
> >>>>>> can't invoke tx_prepare() on the same stage?
> >>>>>> There seems to be some misunderstanding about what tx_prepare()
> >> does -
> >>>>>> in fact it doesn't communicate with HW queue (doesn't update TXD
> >> ring,
> >>>>>> etc.), what it does - just make changes in mbuf itself.
> >>>>>> Yes, it reads some fields in SW TX queue struct (max number of
> >> TXDs per
> >>>>>> packet, etc.), but AFAIK it is safe
> >>>>>> to call tx_prepare() and tx_burst() from different threads.
> >>>>>> At least on implementations I am aware about.
> >>>>>> Just checked the docs - it seems not stated explicitly anywhere,
> >> might
> >>>>>> be that's why it causing such misunderstanding.
> >>>>>>
> >>>>>>>
> >>>>>>> Also, how is rte_eth_tx_prepare() supposed to work for cloned
> >> packets
> >>>>>> egressing on different NIC hardware?
> >>>>>>
> >>>>>> If you create a clone of full packet (including L2/L3) headers
> >> then
> >>>>>> obviously such construction might not
> >>>>>> work properly with tx_prepare() over two different NICs.
> >>>>>> Though In majority of cases you do clone segments with data,
> >> while at
> >>>>>> least L2 headers are put into different segments.
> >>>>>> One simple approach would be to keep L3 header in that separate
> >> segment.
> >>>>>> But yes, there is a problem when you'll need to send exactly the
> >> same
> >>>>>> packet over different NICs.
> >>>>>> As I remember, for bonding PMD things don't work quite well here
> >> - you
> >>>>>> might have a bond over 2 NICs with
> >>>>>> different tx_prepare() and which one to call might be not clear
> >> till
> >>>>>> actual PMD tx_burst() is invoked.
> >>>>>>
> >>>>>>>
> >>>>>>> In theory, it might get even worse if we make this opaque
> >> instead of
> >>>>>> transparent and standardized:
> >>>>>>> One PMD might reset out_ip checksum to 0x0000, and another PMD
> >> might
> >>>>>> reset it to 0xFFFF.
> >>>>>>
> >>>>>>>
> >>>>>>> I can only see one solution:
> >>>>>>> We need to standardize on common minimum requirements for how
> >> to
> >>>>>> prepare packets for each TX offload.
> >>>>>>
> >>>>>> If we can make each and every vendor to agree here - that
> >> definitely
> >>>>>> will help to simplify things quite a bit.
> >>>>>
> >>>>> An API is more than a function name and parameters.
> >>>>> It also has preconditions and postconditions.
> >>>>>
> >>>>> All major NIC vendors are contributing to DPDK.
> >>>>> It should be possible to reach consensus for reasonable minimum
> >> requirements
> >>>> for offloads.
> >>>>> Hardware- and driver-specific exceptions can be documented with
> >> the offload
> >>>> flag, or with rte_eth_rx/tx_burst(), like the note to
> >>>>> rte_eth_rx_burst():
> >>>>> "Some drivers using vector instructions require that nb_pkts is
> >> divisible by
> >>>> 4 or 8, depending on the driver implementation."
> >>>>
> >>>> If we introduce a rule that everyone supposed to follow and then
> >> straightway
> >>>> allow people to have a 'documented exceptions',
> >>>> for me it means like 'no rule' in practice.
> >>>> A 'documented exceptions' approach might work if you have 5
> >> different PMDs to
> >>>> support, but not when you have 50+.
> >>>> No-one would write an app with possible 10 different exception
> cases
> >> in his
> >>>> head.
> >>>> Again, with such approach we can forget about backward
> >> compatibility.
> >>>> I think we already had this discussion before, my opinion remains
> >> the same
> >>>> here -
> >>>> 'documented exceptions' approach is a way to trouble.
> >>>
> >>> The "minimum requirements" should be the lowest common denominator
> of
> >> all NICs.
> >>> Exceptions should be extremely few, for outlier NICs that still want
> >> to provide an offload and its driver is unable to live up to the
> >>> minimum requirements.
> >>> Any exception should require techboard approval. If a NIC/driver
> does
> >> not support the "minimum requirements" for an offload
> >>> feature, it is not allowed to claim support for that offload
> feature,
> >> or needs to seek approval for an exception.
> >>>
> >>> As another option for NICs not supporting the minimum requirements
> of
> >> an offload feature, we could introduce offload flags with
> >>> finer granularity. E.g. one offload flag for "gold standard" TX
> >> checksum update (where the packet's checksum field can have any
> >>> value), and another offload flag for "silver standard" TX checksum
> >> update (where the packet's checksum field must have a
> >>> precomputed value).
> >>
> >> Actually yes, I was thinking in the same direction - we need some
> extra
> >> API to allow user to distinguish.
> >> Probably we can do something like that: a new API for the ethdev call
> >> that would take as a parameter
> >> TX offloads bitmap and in return specify would it need to modify
> >> contents of packet to support these
> >> offloads or not.
> >> Something like:
> >> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> >>
> >> For the majority of the drivers that satisfy these "minimum
> >> requirements" corresponding devops
> >> entry will be empty and we'll always return 0, otherwise PMD has to
> >> provide a proper devop.
> >> Then again, it would be up to the user, to determine can he pass same
> >> packet to 2 different NICs or not.
> >>
> >> I suppose it is similar to what you were talking about?
> >
> > I was thinking something more simple:
> >
> > The NIC exposes its RX and TX offload capabilities to the application
> through the rx/tx_offload_capa and other fields in the rte_eth_dev_info
> structure returned by rte_eth_dev_info_get().
> >
> > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> set.
> > These capability flags (or enums) are mostly undocumented in the code,
> but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that
> the NIC is able to update the IPv4 header checksum at egress (on the
> wire, i.e. without modifying the mbuf or packet data), and that the
> application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this
> offload.
> > I would define and document what each capability flag/enum exactly
> means, the minimum requirements (as defined by the DPDK community) for
> the driver to claim support for it, and the requirements for an
> application to use it.
> >
>
> +1 to improve documentation, and clear offload where it is needed.
>
> Another gap is in testing, whenever a device/driver claims an offload
> capability, we don't have a test suit to confirm and verify this claim.
Yep, conformance testing is lacking big time in our CI.
Adding the ts-factory tests to the CI was a big step in this direction.
But for now, we are mostly relying on vendor's internal testing.
>
>
> > For the sake of discussion, let's say that
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> capability (i.e. no requirements to the checksum field in the packet
> contents).
> > If some NIC requires the checksum field in the packet contents to have
> a precomputed value, the NIC would not be allowed to claim the
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > Such a NIC would need to define and document a new capability, e.g.
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX
> checksum update capability.
> >
> > In other words: I would encode variations of offload capabilities
> directly in the capabilities flags.
> > Then we don't need additional APIs to help interpret those
> capabilities.
> >
> > This way, the application can probe the NIC capabilities to determine
> what can be offloaded, and how to do it.
> >
> > The application can be designed to:
> > 1. use a common packet processing pipeline, utilizing only the lowest
> common capabilities denominator of all detected NICs, or
> > 2. use a packet processing pipeline, handling packets differently
> according to the capabilities of the involved NICs.
> >
>
> Offload capabilities are already provided to enable applications as you
> mentioned above.
>
> Agree that '_ASSISTED' capability flags gives more details to
> application to manage it but my concern is it complicates offloading
> more.
>
> The number of "assisted" offloads is not much, mainly it is for cksum.
> Current approach is simpler, devices requires precondition implements it
> in 'tx_prepare' and application using these offloads calls before
> 'tx_burst'. Device/drivers doesn't require it don't have 'tx_prepare' so
> no impact to the application.
>
> Will it work to have something in between, another capability to hold if
> 'tx_prepare' needs to be called with this device or not?
> It can be possible to make this variable 32bits that holds offloads
> require assistance in a bit-wise way, but I prefer simple if
> 'tx_prepare' required flag, this may help application to decide to use
> offloads in that device and to call 'tx_prepare' or not.
Consider an IP Multicast packet to be transmitted on many ports.
With my suggestion, the packet can be prepared once - i.e. the output/forwarding stage sets the IP header checksum as required by the (lowest common denominator) offload capabilities flag - before being cloned for tx() on each port.
With tx_prepare(), the packet needs to be copied (deep copy!) for each port, and then tx_prepare() needs to be called for each port before tx().
The opaque tx_prepare() concept violates the concept of not modifying the packet at egress.
DPDK follows a principle of not using opaque types, so application developers can optimize accordingly. The opaque behavior of tx_prepare() violates this principle, and I consider it a horrible hack!
>
> > NB: There may be other variations than requiring packet contents to be
> modified, and they might be granular.
> > E.g. a NIC might require assistance for TCP/UDP checksum offload, but
> not for IP checksum offload, so a function telling if packet contents
> requires modification would not suffice.
> > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the
> rte_eth_dev_info structure doesn't expose information about the max
> number of segments it can handle.
> >
>
> Another good point to report max number of segment can be handled, +1
>
>
> > PS: For backwards compatibility, we might define
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to
> support the current "minimum requirements", and add
> RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
> >
> >
> >>
> >>> For reference, consider RSS, where the feature support flags have
> very
> >> high granularity.
> >>>
> >>>>
> >>>>> You mention the bonding driver, which is a good example.
> >>>>> The rte_eth_tx_burst() documentation has a note about the API
> >> postcondition
> >>>> exception for the bonding driver:
> >>>>> "This function must not modify mbufs (including packets data)
> >> unless the
> >>>> refcnt is 1. An exception is the bonding PMD, [...], mbufs
> >>>>> may be modified."
> >>>>
> >>>> For me, what we've done for bonding tx_prepare/tx_burst() is a
> >> really bad
> >>>> example.
> >>>> Initial agreement and design choice was that tx_burst() should not
> >> modify
> >>>> contents of the packets
> >>>> (that actually was one of the reasons why tx_prepare() was
> >> introduced).
> >>>> The only reason I agreed on that exception - because I couldn't
> >> come-up with
> >>>> something less uglier.
> >>>>
> >>>> Actually, these problems with bonding PMD made me to start thinking
> >> that
> >>>> current
> >>>> tx_prepare/tx_burst approach might need to be reconsidered somehow.
> >>>
> >>> In cases where a preceding call to tx_prepare() is required, how is
> it
> >> worse modifying the packet in tx_burst() than modifying the
> >>> packet in tx_prepare()?
> >>>
> >>> Both cases violate the postcondition that packets are not modified
> at
> >> egress.
> >>>
> >>>>
> >>>>>> Then we can probably have one common tx_prepare() for all
> >> vendors ;)
> >>>>>
> >>>>> Yes, that would be the goal.
> >>>>> More realistically, the ethdev layer could perform the common
> >> checks, and
> >>>> only the non-conforming drivers would have to implement
> >>>>> their specific tweaks.
> >>>>
> >>>> Hmm, but that's what we have right now:
> >>>> - fields in mbuf and packet data that user has to fill correctly
> and
> >> dev
> >>>> specific tx_prepare().
> >>>> How what you suggest will differ then?
> >>>
> >>> You're 100 % right here. We could move more checks into the ethdev
> >> layer, specifically checks related to the "minimum
> >>> requirements".
> >>>
> >>>> And how it will help let say with bonding PMD situation, or with
> TX-
> >> ing of the
> >>>> same packet over 2 different NICs?
> >>>
> >>> The bonding driver is broken.
> >>> It can only be fixed by not violating the egress postcondition in
> >> either tx_burst() or tx_prepare().
> >>> "Minimum requirements" might help doing that.
> >>>
> >>>>
> >>>>> If we don't standardize the meaning of the offload flags, the
> >> application
> >>>> developers cannot trust them!
> >>>>> I'm afraid this is the current situation - application developers
> >> either
> >>>> test with specific NIC hardware, or don't use the offload features.
> >>>>
> >>>> Well, I have used TX offloads through several projects, it worked
> >> quite well.
> >>>
> >>> That is good to hear.
> >>> And I don't oppose to that.
> >>>
> >>> In this discussion, I am worried about the roadmap direction for
> DPDK.
> >>> I oppose to the concept of requiring calling tx_prepare() before
> >> calling tx_burst() when using offload. I think it is conceptually
> wrong,
> >>> and breaks the egress postcondition.
> >>> I propose "minimum requirements" as a better solution.
> >>>
> >>>> Though have to admit, never have to use TX offloads together with
> >> our bonding
> >>>> PMD.
> >>>>
> >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-12 15:54 ` Morten Brørup
@ 2024-04-16 9:16 ` Konstantin Ananyev
2024-04-16 11:36 ` Konstantin Ananyev
0 siblings, 1 reply; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-16 9:16 UTC (permalink / raw)
To: Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx
> > checksum
> > > > offload
> > > > > > > > > > examples.
> > > > > > > > >
> > > > > > > > > I strongly disagree with this change!
> > > > > > > > >
> > > > > > > > > It will cause a huge performance degradation for shaping
> > > > applications:
> > > > > > > > >
> > > > > > > > > A packet will be processed and finalized at an output or
> > > > forwarding
> > > > > > > > pipeline stage, where some other fields might also be
> > written,
> > > > so
> > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low
> > cost
> > > > (no new
> > > > > > > > cache misses).
> > > > > > > > >
> > > > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > > > >
> > > > > > > > > If rte_eth_tx_prepare() must be called at the egress
> > pipeline
> > > > stage,
> > > > > > > > it has to write to the packet and cause a cache miss per
> > packet,
> > > > > > > > > instead of simply passing on the packet to the NIC
> > hardware.
> > > > > > > > >
> > > > > > > > > It must be possible to finalize the packet at the
> > > > output/forwarding
> > > > > > > > pipeline stage!
> > > > > > > >
> > > > > > > > If you can finalize your packet on output/forwarding, then
> > why
> > > > you
> > > > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > > > There seems to be some misunderstanding about what
> > tx_prepare()
> > > > does -
> > > > > > > > in fact it doesn't communicate with HW queue (doesn't update
> > TXD
> > > > ring,
> > > > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > > > Yes, it reads some fields in SW TX queue struct (max number
> > of
> > > > TXDs per
> > > > > > > > packet, etc.), but AFAIK it is safe
> > > > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > > > At least on implementations I am aware about.
> > > > > > > > Just checked the docs - it seems not stated explicitly
> > anywhere,
> > > > might
> > > > > > > > be that's why it causing such misunderstanding.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for
> > cloned
> > > > packets
> > > > > > > > egressing on different NIC hardware?
> > > > > > > >
> > > > > > > > If you create a clone of full packet (including L2/L3)
> > headers
> > > > then
> > > > > > > > obviously such construction might not
> > > > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > > > Though In majority of cases you do clone segments with data,
> > > > while at
> > > > > > > > least L2 headers are put into different segments.
> > > > > > > > One simple approach would be to keep L3 header in that
> > separate
> > > > segment.
> > > > > > > > But yes, there is a problem when you'll need to send exactly
> > the
> > > > same
> > > > > > > > packet over different NICs.
> > > > > > > > As I remember, for bonding PMD things don't work quite well
> > here
> > > > - you
> > > > > > > > might have a bond over 2 NICs with
> > > > > > > > different tx_prepare() and which one to call might be not
> > clear
> > > > till
> > > > > > > > actual PMD tx_burst() is invoked.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > In theory, it might get even worse if we make this opaque
> > > > instead of
> > > > > > > > transparent and standardized:
> > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and another
> > PMD
> > > > might
> > > > > > > > reset it to 0xFFFF.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I can only see one solution:
> > > > > > > > > We need to standardize on common minimum requirements for
> > how
> > > > to
> > > > > > > > prepare packets for each TX offload.
> > > > > > > >
> > > > > > > > If we can make each and every vendor to agree here - that
> > > > definitely
> > > > > > > > will help to simplify things quite a bit.
> > > > > > >
> > > > > > > An API is more than a function name and parameters.
> > > > > > > It also has preconditions and postconditions.
> > > > > > >
> > > > > > > All major NIC vendors are contributing to DPDK.
> > > > > > > It should be possible to reach consensus for reasonable
> > minimum
> > > > requirements
> > > > > > for offloads.
> > > > > > > Hardware- and driver-specific exceptions can be documented
> > with
> > > > the offload
> > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > > > rte_eth_rx_burst():
> > > > > > > "Some drivers using vector instructions require that nb_pkts
> > is
> > > > divisible by
> > > > > > 4 or 8, depending on the driver implementation."
> > > > > >
> > > > > > If we introduce a rule that everyone supposed to follow and then
> > > > straightway
> > > > > > allow people to have a 'documented exceptions',
> > > > > > for me it means like 'no rule' in practice.
> > > > > > A 'documented exceptions' approach might work if you have 5
> > > > different PMDs to
> > > > > > support, but not when you have 50+.
> > > > > > No-one would write an app with possible 10 different exception
> > cases
> > > > in his
> > > > > > head.
> > > > > > Again, with such approach we can forget about backward
> > > > compatibility.
> > > > > > I think we already had this discussion before, my opinion
> > remains
> > > > the same
> > > > > > here -
> > > > > > 'documented exceptions' approach is a way to trouble.
> > > > >
> > > > > The "minimum requirements" should be the lowest common denominator
> > of
> > > > all NICs.
> > > > > Exceptions should be extremely few, for outlier NICs that still
> > want
> > > > to provide an offload and its driver is unable to live up to the
> > > > > minimum requirements.
> > > > > Any exception should require techboard approval. If a NIC/driver
> > does
> > > > not support the "minimum requirements" for an offload
> > > > > feature, it is not allowed to claim support for that offload
> > feature,
> > > > or needs to seek approval for an exception.
> > > > >
> > > > > As another option for NICs not supporting the minimum requirements
> > of
> > > > an offload feature, we could introduce offload flags with
> > > > > finer granularity. E.g. one offload flag for "gold standard" TX
> > > > checksum update (where the packet's checksum field can have any
> > > > > value), and another offload flag for "silver standard" TX checksum
> > > > update (where the packet's checksum field must have a
> > > > > precomputed value).
> > > >
> > > > Actually yes, I was thinking in the same direction - we need some
> > extra
> > > > API to allow user to distinguish.
> > > > Probably we can do something like that: a new API for the ethdev
> > call
> > > > that would take as a parameter
> > > > TX offloads bitmap and in return specify would it need to modify
> > > > contents of packet to support these
> > > > offloads or not.
> > > > Something like:
> > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > > >
> > > > For the majority of the drivers that satisfy these "minimum
> > > > requirements" corresponding devops
> > > > entry will be empty and we'll always return 0, otherwise PMD has to
> > > > provide a proper devop.
> > > > Then again, it would be up to the user, to determine can he pass
> > same
> > > > packet to 2 different NICs or not.
> > > >
> > > > I suppose it is similar to what you were talking about?
> > >
> > > I was thinking something more simple:
> > >
> > > The NIC exposes its RX and TX offload capabilities to the application
> > through the rx/tx_offload_capa and other fields in the
> > > rte_eth_dev_info structure returned by rte_eth_dev_info_get().
> > >
> > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> > set.
> > > These capability flags (or enums) are mostly undocumented in the code,
> > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> > > capability means that the NIC is able to update the IPv4 header
> > checksum at egress (on the wire, i.e. without modifying the mbuf or
> > > packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM
> > in the mbufs to utilize this offload.
> > > I would define and document what each capability flag/enum exactly
> > means, the minimum requirements (as defined by the DPDK
> > > community) for the driver to claim support for it, and the
> > requirements for an application to use it.
> > > For the sake of discussion, let's say that
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> > capability
> > > (i.e. no requirements to the checksum field in the packet contents).
> > > If some NIC requires the checksum field in the packet contents to have
> > a precomputed value, the NIC would not be allowed to claim
> > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > Such a NIC would need to define and document a new capability, e.g.
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> > > standard" TX checksum update capability.
> > > In other words: I would encode variations of offload capabilities
> > directly in the capabilities flags.
> > > Then we don't need additional APIs to help interpret those
> > capabilities.
> >
> > I understood your intention with different flags, yes it should work too
> > I think.
> > The reason I am not very fond of it - it will require to double
> > TX_OFFLOAD flags.
>
> An additional feature flag is only required if a NIC is not conforming to the "minimum requirements" of an offload feature, and the
> techboard permits introducing a variant of an existing feature.
> There should be very few additional feature flags for variants - exceptions only - or the "minimum requirements" are not broad
> enough to support the majority of NICs.
Ok, so you suggest to group all existing reqs plus what all current tx_prepare() do into "minimum requirements"?
So with current drivers in place we wouldn't need these new flags, but we'll reserve such opportunity.
That might work, if there are no contradictory requirements in current PMDs, and PMDs maintainers with
less reqs will agree with these 'extra' stuff.
> >
> > > This way, the application can probe the NIC capabilities to determine
> > what can be offloaded, and how to do it.
> > >
> > > The application can be designed to:
> > > 1. use a common packet processing pipeline, utilizing only the lowest
> > common capabilities denominator of all detected NICs, or
> > > 2. use a packet processing pipeline, handling packets differently
> > according to the capabilities of the involved NICs.
> > >
> > > NB: There may be other variations than requiring packet contents to be
> > modified, and they might be granular.
> > > E.g. a NIC might require assistance for TCP/UDP checksum offload, but
> > not for IP checksum offload, so a function telling if packet
> > > contents requires modification would not suffice.
> >
> > Why not?
> > If user plans to use multiple tx offloads provide a bitmask of all of
> > them as an argument.
> > Let say for both L3 and L4 cksum offloads it will be something like:
> > (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
> > RTE_ETH_TX_OFFLOAD_TCP_CKSUM)
>
> You are partially right; the offload flags can be tested one by one to determine which header fields in the packet contents need to be
> updated.
Why one by one?
I think user will be able to test all TX offloads he needs at once.
What user needs to know - can he expect that for tx offloads he selected PMD tx_burst will not modify the packet and metadata.
If the answer is 'yes' then he can safely TX the same mbuf over different PMDs simultaneously.
If the answer is 'no' then he either has to avoid TX offloads that cause modifications or be prepared to overcome that situation:
copy packet, or might be just copy only inner/outer L2/L3/L4 headers into a separate segment, etc.
> I'm assuming the suggested function returns a Boolean; so it doesn't tell the application how it should modify the packet contents, e.g.
> if the header's checksum field must be zeroed or contain the precomputed checksum of the pseudo header.
> Alternatively, if it returns a bitfield with flags for different types of modification, the information could be encoded that way.
That seems an unnecessary overcomplication to me and probably not necessary, see above what was my thought.
> But then
> they might as well be returned as offload capability variation flags in the rte_eth_dev_info structure's tx_offload_capa field, as I'm
> advocating for.
>
> >
> > > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the
> > rte_eth_dev_info structure doesn't expose information about the max
> > > number of segments it can handle.
> > >
> > > PS: For backwards compatibility, we might define
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to
> > support the
> > > current "minimum requirements", and add
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
> > >
> > >
> > > >
> > > > > For reference, consider RSS, where the feature support flags have
> > very
> > > > high granularity.
> > > > >
> > > > > >
> > > > > > > You mention the bonding driver, which is a good example.
> > > > > > > The rte_eth_tx_burst() documentation has a note about the API
> > > > postcondition
> > > > > > exception for the bonding driver:
> > > > > > > "This function must not modify mbufs (including packets data)
> > > > unless the
> > > > > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > > > > may be modified."
> > > > > >
> > > > > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> > > > really bad
> > > > > > example.
> > > > > > Initial agreement and design choice was that tx_burst() should
> > not
> > > > modify
> > > > > > contents of the packets
> > > > > > (that actually was one of the reasons why tx_prepare() was
> > > > introduced).
> > > > > > The only reason I agreed on that exception - because I couldn't
> > > > come-up with
> > > > > > something less uglier.
> > > > > >
> > > > > > Actually, these problems with bonding PMD made me to start
> > thinking
> > > > that
> > > > > > current
> > > > > > tx_prepare/tx_burst approach might need to be reconsidered
> > somehow.
> > > > >
> > > > > In cases where a preceding call to tx_prepare() is required, how
> > is it
> > > > worse modifying the packet in tx_burst() than modifying the
> > > > > packet in tx_prepare()?
> > > > >
> > > > > Both cases violate the postcondition that packets are not modified
> > at
> > > > egress.
> > > > >
> > > > > >
> > > > > > > > Then we can probably have one common tx_prepare() for all
> > > > vendors ;)
> > > > > > >
> > > > > > > Yes, that would be the goal.
> > > > > > > More realistically, the ethdev layer could perform the common
> > > > checks, and
> > > > > > only the non-conforming drivers would have to implement
> > > > > > > their specific tweaks.
> > > > > >
> > > > > > Hmm, but that's what we have right now:
> > > > > > - fields in mbuf and packet data that user has to fill correctly
> > and
> > > > dev
> > > > > > specific tx_prepare().
> > > > > > How what you suggest will differ then?
> > > > >
> > > > > You're 100 % right here. We could move more checks into the ethdev
> > > > layer, specifically checks related to the "minimum
> > > > > requirements".
> > > > >
> > > > > > And how it will help let say with bonding PMD situation, or with
> > TX-
> > > > ing of the
> > > > > > same packet over 2 different NICs?
> > > > >
> > > > > The bonding driver is broken.
> > > > > It can only be fixed by not violating the egress postcondition in
> > > > either tx_burst() or tx_prepare().
> > > > > "Minimum requirements" might help doing that.
> > > > >
> > > > > >
> > > > > > > If we don't standardize the meaning of the offload flags, the
> > > > application
> > > > > > developers cannot trust them!
> > > > > > > I'm afraid this is the current situation - application
> > developers
> > > > either
> > > > > > test with specific NIC hardware, or don't use the offload
> > features.
> > > > > >
> > > > > > Well, I have used TX offloads through several projects, it
> > worked
> > > > quite well.
> > > > >
> > > > > That is good to hear.
> > > > > And I don't oppose to that.
> > > > >
> > > > > In this discussion, I am worried about the roadmap direction for
> > DPDK.
> > > > > I oppose to the concept of requiring calling tx_prepare() before
> > > > calling tx_burst() when using offload. I think it is conceptually
> > wrong,
> > > > > and breaks the egress postcondition.
> > > > > I propose "minimum requirements" as a better solution.
> > > > >
> > > > > > Though have to admit, never have to use TX offloads together
> > with
> > > > our bonding
> > > > > > PMD.
> > > > > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-16 7:14 ` Morten Brørup
@ 2024-04-16 9:26 ` Konstantin Ananyev
0 siblings, 0 replies; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-16 9:26 UTC (permalink / raw)
To: Morten Brørup, Ferruh Yigit, David Marchand, dev
Cc: thomas, stable, Olivier Matz, Jijiang Liu, Andrew Rybchenko,
Kaiwen Deng, qiming.yang, yidingx.zhou, Aman Singh, Yuying Zhang,
Jerin Jacob
> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > Sent: Monday, 15 April 2024 17.08
> >
> > On 4/12/2024 3:44 PM, Morten Brørup wrote:
> > >>>>>>>> Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> > >> offload
> > >>>>>>>> examples.
> > >>>>>>>
> > >>>>>>> I strongly disagree with this change!
> > >>>>>>>
> > >>>>>>> It will cause a huge performance degradation for shaping
> > >> applications:
> > >>>>>>>
> > >>>>>>> A packet will be processed and finalized at an output or
> > >> forwarding
> > >>>>>> pipeline stage, where some other fields might also be written,
> > >> so
> > >>>>>>> zeroing e.g. the out_ip checksum at this stage has low cost
> > >> (no new
> > >>>>>> cache misses).
> > >>>>>>>
> > >>>>>>> Then, the packet might be queued for QoS or similar.
> > >>>>>>>
> > >>>>>>> If rte_eth_tx_prepare() must be called at the egress pipeline
> > >> stage,
> > >>>>>> it has to write to the packet and cause a cache miss per packet,
> > >>>>>>> instead of simply passing on the packet to the NIC hardware.
> > >>>>>>>
> > >>>>>>> It must be possible to finalize the packet at the
> > >> output/forwarding
> > >>>>>> pipeline stage!
> > >>>>>>
> > >>>>>> If you can finalize your packet on output/forwarding, then why
> > >> you
> > >>>>>> can't invoke tx_prepare() on the same stage?
> > >>>>>> There seems to be some misunderstanding about what tx_prepare()
> > >> does -
> > >>>>>> in fact it doesn't communicate with HW queue (doesn't update TXD
> > >> ring,
> > >>>>>> etc.), what it does - just make changes in mbuf itself.
> > >>>>>> Yes, it reads some fields in SW TX queue struct (max number of
> > >> TXDs per
> > >>>>>> packet, etc.), but AFAIK it is safe
> > >>>>>> to call tx_prepare() and tx_burst() from different threads.
> > >>>>>> At least on implementations I am aware about.
> > >>>>>> Just checked the docs - it seems not stated explicitly anywhere,
> > >> might
> > >>>>>> be that's why it causing such misunderstanding.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Also, how is rte_eth_tx_prepare() supposed to work for cloned
> > >> packets
> > >>>>>> egressing on different NIC hardware?
> > >>>>>>
> > >>>>>> If you create a clone of full packet (including L2/L3) headers
> > >> then
> > >>>>>> obviously such construction might not
> > >>>>>> work properly with tx_prepare() over two different NICs.
> > >>>>>> Though In majority of cases you do clone segments with data,
> > >> while at
> > >>>>>> least L2 headers are put into different segments.
> > >>>>>> One simple approach would be to keep L3 header in that separate
> > >> segment.
> > >>>>>> But yes, there is a problem when you'll need to send exactly the
> > >> same
> > >>>>>> packet over different NICs.
> > >>>>>> As I remember, for bonding PMD things don't work quite well here
> > >> - you
> > >>>>>> might have a bond over 2 NICs with
> > >>>>>> different tx_prepare() and which one to call might be not clear
> > >> till
> > >>>>>> actual PMD tx_burst() is invoked.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> In theory, it might get even worse if we make this opaque
> > >> instead of
> > >>>>>> transparent and standardized:
> > >>>>>>> One PMD might reset out_ip checksum to 0x0000, and another PMD
> > >> might
> > >>>>>> reset it to 0xFFFF.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> I can only see one solution:
> > >>>>>>> We need to standardize on common minimum requirements for how
> > >> to
> > >>>>>> prepare packets for each TX offload.
> > >>>>>>
> > >>>>>> If we can make each and every vendor to agree here - that
> > >> definitely
> > >>>>>> will help to simplify things quite a bit.
> > >>>>>
> > >>>>> An API is more than a function name and parameters.
> > >>>>> It also has preconditions and postconditions.
> > >>>>>
> > >>>>> All major NIC vendors are contributing to DPDK.
> > >>>>> It should be possible to reach consensus for reasonable minimum
> > >> requirements
> > >>>> for offloads.
> > >>>>> Hardware- and driver-specific exceptions can be documented with
> > >> the offload
> > >>>> flag, or with rte_eth_rx/tx_burst(), like the note to
> > >>>>> rte_eth_rx_burst():
> > >>>>> "Some drivers using vector instructions require that nb_pkts is
> > >> divisible by
> > >>>> 4 or 8, depending on the driver implementation."
> > >>>>
> > >>>> If we introduce a rule that everyone supposed to follow and then
> > >> straightway
> > >>>> allow people to have a 'documented exceptions',
> > >>>> for me it means like 'no rule' in practice.
> > >>>> A 'documented exceptions' approach might work if you have 5
> > >> different PMDs to
> > >>>> support, but not when you have 50+.
> > >>>> No-one would write an app with possible 10 different exception
> > cases
> > >> in his
> > >>>> head.
> > >>>> Again, with such approach we can forget about backward
> > >> compatibility.
> > >>>> I think we already had this discussion before, my opinion remains
> > >> the same
> > >>>> here -
> > >>>> 'documented exceptions' approach is a way to trouble.
> > >>>
> > >>> The "minimum requirements" should be the lowest common denominator
> > of
> > >> all NICs.
> > >>> Exceptions should be extremely few, for outlier NICs that still want
> > >> to provide an offload and its driver is unable to live up to the
> > >>> minimum requirements.
> > >>> Any exception should require techboard approval. If a NIC/driver
> > does
> > >> not support the "minimum requirements" for an offload
> > >>> feature, it is not allowed to claim support for that offload
> > feature,
> > >> or needs to seek approval for an exception.
> > >>>
> > >>> As another option for NICs not supporting the minimum requirements
> > of
> > >> an offload feature, we could introduce offload flags with
> > >>> finer granularity. E.g. one offload flag for "gold standard" TX
> > >> checksum update (where the packet's checksum field can have any
> > >>> value), and another offload flag for "silver standard" TX checksum
> > >> update (where the packet's checksum field must have a
> > >>> precomputed value).
> > >>
> > >> Actually yes, I was thinking in the same direction - we need some
> > extra
> > >> API to allow user to distinguish.
> > >> Probably we can do something like that: a new API for the ethdev call
> > >> that would take as a parameter
> > >> TX offloads bitmap and in return specify would it need to modify
> > >> contents of packet to support these
> > >> offloads or not.
> > >> Something like:
> > >> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > >>
> > >> For the majority of the drivers that satisfy these "minimum
> > >> requirements" corresponding devops
> > >> entry will be empty and we'll always return 0, otherwise PMD has to
> > >> provide a proper devop.
> > >> Then again, it would be up to the user, to determine can he pass same
> > >> packet to 2 different NICs or not.
> > >>
> > >> I suppose it is similar to what you were talking about?
> > >
> > > I was thinking something more simple:
> > >
> > > The NIC exposes its RX and TX offload capabilities to the application
> > through the rx/tx_offload_capa and other fields in the rte_eth_dev_info
> > structure returned by rte_eth_dev_info_get().
> > >
> > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> > set.
> > > These capability flags (or enums) are mostly undocumented in the code,
> > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that
> > the NIC is able to update the IPv4 header checksum at egress (on the
> > wire, i.e. without modifying the mbuf or packet data), and that the
> > application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this
> > offload.
> > > I would define and document what each capability flag/enum exactly
> > means, the minimum requirements (as defined by the DPDK community) for
> > the driver to claim support for it, and the requirements for an
> > application to use it.
> > >
> >
> > +1 to improve documentation, and clear offload where it is needed.
> >
> > Another gap is in testing, whenever a device/driver claims an offload
> > capability, we don't have a test suit to confirm and verify this claim.
>
> Yep, conformance testing is lacking big time in our CI.
> Adding the ts-factory tests to the CI was a big step in this direction.
> But for now, we are mostly relying on vendor's internal testing.
>
> >
> >
> > > For the sake of discussion, let's say that
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> > capability (i.e. no requirements to the checksum field in the packet
> > contents).
> > > If some NIC requires the checksum field in the packet contents to have
> > a precomputed value, the NIC would not be allowed to claim the
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > Such a NIC would need to define and document a new capability, e.g.
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX
> > checksum update capability.
> > >
> > > In other words: I would encode variations of offload capabilities
> > directly in the capabilities flags.
> > > Then we don't need additional APIs to help interpret those
> > capabilities.
> > >
> > > This way, the application can probe the NIC capabilities to determine
> > what can be offloaded, and how to do it.
> > >
> > > The application can be designed to:
> > > 1. use a common packet processing pipeline, utilizing only the lowest
> > common capabilities denominator of all detected NICs, or
> > > 2. use a packet processing pipeline, handling packets differently
> > according to the capabilities of the involved NICs.
> > >
> >
> > Offload capabilities are already provided to enable applications as you
> > mentioned above.
> >
> > Agree that '_ASSISTED' capability flags gives more details to
> > application to manage it but my concern is it complicates offloading
> > more.
> >
> > The number of "assisted" offloads is not much, mainly it is for cksum.
> > Current approach is simpler, devices requires precondition implements it
> > in 'tx_prepare' and application using these offloads calls before
> > 'tx_burst'. Device/drivers doesn't require it don't have 'tx_prepare' so
> > no impact to the application.
> >
> > Will it work to have something in between, another capability to hold if
> > 'tx_prepare' needs to be called with this device or not?
> > It can be possible to make this variable 32bits that holds offloads
> > require assistance in a bit-wise way, but I prefer simple if
> > 'tx_prepare' required flag, this may help application to decide to use
> > offloads in that device and to call 'tx_prepare' or not.
>
> Consider an IP Multicast packet to be transmitted on many ports.
>
> With my suggestion, the packet can be prepared once - i.e. the output/forwarding stage sets the IP header checksum as required by
> the (lowest common denominator) offload capabilities flag - before being cloned for tx() on each port.
>
> With tx_prepare(), the packet needs to be copied (deep copy!) for each port, and then tx_prepare() needs to be called for each port
> before tx().
Not really. At least not always.
Let say for multicast, you'll most likely will need to have a different L2 header for each ethdev port anyway.
Usual way to overcome it - have a separate segment for L2 header attached to the rest of the packet data segment.
Nothing stops you to have inner/outer L3/L4 headers in that 'header' segment too.
Then you do need to have a multiple copies of data segment.
Actually after another thought - that might be a simple way to fix problem with bonding PMD:
allow it to create a 'header' segments on need.
Not sure how it will affect performance, but should help to avoid current problem when packet contents
are modified in tx_burst().
> The opaque tx_prepare() concept violates the concept of not modifying the packet at egress.
That actually depends where you draw a line for egress: before or after tx_prepare().
> DPDK follows a principle of not using opaque types, so application developers can optimize accordingly. The opaque behavior of
> tx_prepare() violates this principle, and I consider it a horrible hack!
> >
> > > NB: There may be other variations than requiring packet contents to be
> > modified, and they might be granular.
> > > E.g. a NIC might require assistance for TCP/UDP checksum offload, but
> > not for IP checksum offload, so a function telling if packet contents
> > requires modification would not suffice.
> > > E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the
> > rte_eth_dev_info structure doesn't expose information about the max
> > number of segments it can handle.
> > >
> >
> > Another good point to report max number of segment can be handled, +1
> >
> >
> > > PS: For backwards compatibility, we might define
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to
> > support the current "minimum requirements", and add
> > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.
> > >
> > >
> > >>
> > >>> For reference, consider RSS, where the feature support flags have
> > very
> > >> high granularity.
> > >>>
> > >>>>
> > >>>>> You mention the bonding driver, which is a good example.
> > >>>>> The rte_eth_tx_burst() documentation has a note about the API
> > >> postcondition
> > >>>> exception for the bonding driver:
> > >>>>> "This function must not modify mbufs (including packets data)
> > >> unless the
> > >>>> refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > >>>>> may be modified."
> > >>>>
> > >>>> For me, what we've done for bonding tx_prepare/tx_burst() is a
> > >> really bad
> > >>>> example.
> > >>>> Initial agreement and design choice was that tx_burst() should not
> > >> modify
> > >>>> contents of the packets
> > >>>> (that actually was one of the reasons why tx_prepare() was
> > >> introduced).
> > >>>> The only reason I agreed on that exception - because I couldn't
> > >> come-up with
> > >>>> something less uglier.
> > >>>>
> > >>>> Actually, these problems with bonding PMD made me to start thinking
> > >> that
> > >>>> current
> > >>>> tx_prepare/tx_burst approach might need to be reconsidered somehow.
> > >>>
> > >>> In cases where a preceding call to tx_prepare() is required, how is
> > it
> > >> worse modifying the packet in tx_burst() than modifying the
> > >>> packet in tx_prepare()?
> > >>>
> > >>> Both cases violate the postcondition that packets are not modified
> > at
> > >> egress.
> > >>>
> > >>>>
> > >>>>>> Then we can probably have one common tx_prepare() for all
> > >> vendors ;)
> > >>>>>
> > >>>>> Yes, that would be the goal.
> > >>>>> More realistically, the ethdev layer could perform the common
> > >> checks, and
> > >>>> only the non-conforming drivers would have to implement
> > >>>>> their specific tweaks.
> > >>>>
> > >>>> Hmm, but that's what we have right now:
> > >>>> - fields in mbuf and packet data that user has to fill correctly
> > and
> > >> dev
> > >>>> specific tx_prepare().
> > >>>> How what you suggest will differ then?
> > >>>
> > >>> You're 100 % right here. We could move more checks into the ethdev
> > >> layer, specifically checks related to the "minimum
> > >>> requirements".
> > >>>
> > >>>> And how it will help let say with bonding PMD situation, or with
> > TX-
> > >> ing of the
> > >>>> same packet over 2 different NICs?
> > >>>
> > >>> The bonding driver is broken.
> > >>> It can only be fixed by not violating the egress postcondition in
> > >> either tx_burst() or tx_prepare().
> > >>> "Minimum requirements" might help doing that.
> > >>>
> > >>>>
> > >>>>> If we don't standardize the meaning of the offload flags, the
> > >> application
> > >>>> developers cannot trust them!
> > >>>>> I'm afraid this is the current situation - application developers
> > >> either
> > >>>> test with specific NIC hardware, or don't use the offload features.
> > >>>>
> > >>>> Well, I have used TX offloads through several projects, it worked
> > >> quite well.
> > >>>
> > >>> That is good to hear.
> > >>> And I don't oppose to that.
> > >>>
> > >>> In this discussion, I am worried about the roadmap direction for
> > DPDK.
> > >>> I oppose to the concept of requiring calling tx_prepare() before
> > >> calling tx_burst() when using offload. I think it is conceptually
> > wrong,
> > >>> and breaks the egress postcondition.
> > >>> I propose "minimum requirements" as a better solution.
> > >>>
> > >>>> Though have to admit, never have to use TX offloads together with
> > >> our bonding
> > >>>> PMD.
> > >>>>
> > >
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
2024-04-16 9:16 ` Konstantin Ananyev
@ 2024-04-16 11:36 ` Konstantin Ananyev
0 siblings, 0 replies; 30+ messages in thread
From: Konstantin Ananyev @ 2024-04-16 11:36 UTC (permalink / raw)
To: Konstantin Ananyev, Morten Brørup, David Marchand, dev
Cc: thomas, ferruh.yigit, stable, Olivier Matz, Jijiang Liu,
Andrew Rybchenko, Ferruh Yigit, Kaiwen Deng, qiming.yang,
yidingx.zhou, Aman Singh, Yuying Zhang, Thomas Monjalon,
Jerin Jacob
> > > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx
> > > checksum
> > > > > offload
> > > > > > > > > > > examples.
> > > > > > > > > >
> > > > > > > > > > I strongly disagree with this change!
> > > > > > > > > >
> > > > > > > > > > It will cause a huge performance degradation for shaping
> > > > > applications:
> > > > > > > > > >
> > > > > > > > > > A packet will be processed and finalized at an output or
> > > > > forwarding
> > > > > > > > > pipeline stage, where some other fields might also be
> > > written,
> > > > > so
> > > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low
> > > cost
> > > > > (no new
> > > > > > > > > cache misses).
> > > > > > > > > >
> > > > > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > > > > >
> > > > > > > > > > If rte_eth_tx_prepare() must be called at the egress
> > > pipeline
> > > > > stage,
> > > > > > > > > it has to write to the packet and cause a cache miss per
> > > packet,
> > > > > > > > > > instead of simply passing on the packet to the NIC
> > > hardware.
> > > > > > > > > >
> > > > > > > > > > It must be possible to finalize the packet at the
> > > > > output/forwarding
> > > > > > > > > pipeline stage!
> > > > > > > > >
> > > > > > > > > If you can finalize your packet on output/forwarding, then
> > > why
> > > > > you
> > > > > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > > > > There seems to be some misunderstanding about what
> > > tx_prepare()
> > > > > does -
> > > > > > > > > in fact it doesn't communicate with HW queue (doesn't update
> > > TXD
> > > > > ring,
> > > > > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > > > > Yes, it reads some fields in SW TX queue struct (max number
> > > of
> > > > > TXDs per
> > > > > > > > > packet, etc.), but AFAIK it is safe
> > > > > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > > > > At least on implementations I am aware about.
> > > > > > > > > Just checked the docs - it seems not stated explicitly
> > > anywhere,
> > > > > might
> > > > > > > > > be that's why it causing such misunderstanding.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for
> > > cloned
> > > > > packets
> > > > > > > > > egressing on different NIC hardware?
> > > > > > > > >
> > > > > > > > > If you create a clone of full packet (including L2/L3)
> > > headers
> > > > > then
> > > > > > > > > obviously such construction might not
> > > > > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > > > > Though In majority of cases you do clone segments with data,
> > > > > while at
> > > > > > > > > least L2 headers are put into different segments.
> > > > > > > > > One simple approach would be to keep L3 header in that
> > > separate
> > > > > segment.
> > > > > > > > > But yes, there is a problem when you'll need to send exactly
> > > the
> > > > > same
> > > > > > > > > packet over different NICs.
> > > > > > > > > As I remember, for bonding PMD things don't work quite well
> > > here
> > > > > - you
> > > > > > > > > might have a bond over 2 NICs with
> > > > > > > > > different tx_prepare() and which one to call might be not
> > > clear
> > > > > till
> > > > > > > > > actual PMD tx_burst() is invoked.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In theory, it might get even worse if we make this opaque
> > > > > instead of
> > > > > > > > > transparent and standardized:
> > > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and another
> > > PMD
> > > > > might
> > > > > > > > > reset it to 0xFFFF.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can only see one solution:
> > > > > > > > > > We need to standardize on common minimum requirements for
> > > how
> > > > > to
> > > > > > > > > prepare packets for each TX offload.
> > > > > > > > >
> > > > > > > > > If we can make each and every vendor to agree here - that
> > > > > definitely
> > > > > > > > > will help to simplify things quite a bit.
> > > > > > > >
> > > > > > > > An API is more than a function name and parameters.
> > > > > > > > It also has preconditions and postconditions.
> > > > > > > >
> > > > > > > > All major NIC vendors are contributing to DPDK.
> > > > > > > > It should be possible to reach consensus for reasonable
> > > minimum
> > > > > requirements
> > > > > > > for offloads.
> > > > > > > > Hardware- and driver-specific exceptions can be documented
> > > with
> > > > > the offload
> > > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > > > > rte_eth_rx_burst():
> > > > > > > > "Some drivers using vector instructions require that nb_pkts
> > > is
> > > > > divisible by
> > > > > > > 4 or 8, depending on the driver implementation."
> > > > > > >
> > > > > > > If we introduce a rule that everyone supposed to follow and then
> > > > > straightway
> > > > > > > allow people to have a 'documented exceptions',
> > > > > > > for me it means like 'no rule' in practice.
> > > > > > > A 'documented exceptions' approach might work if you have 5
> > > > > different PMDs to
> > > > > > > support, but not when you have 50+.
> > > > > > > No-one would write an app with possible 10 different exception
> > > cases
> > > > > in his
> > > > > > > head.
> > > > > > > Again, with such approach we can forget about backward
> > > > > compatibility.
> > > > > > > I think we already had this discussion before, my opinion
> > > remains
> > > > > the same
> > > > > > > here -
> > > > > > > 'documented exceptions' approach is a way to trouble.
> > > > > >
> > > > > > The "minimum requirements" should be the lowest common denominator
> > > of
> > > > > all NICs.
> > > > > > Exceptions should be extremely few, for outlier NICs that still
> > > want
> > > > > to provide an offload and its driver is unable to live up to the
> > > > > > minimum requirements.
> > > > > > Any exception should require techboard approval. If a NIC/driver
> > > does
> > > > > not support the "minimum requirements" for an offload
> > > > > > feature, it is not allowed to claim support for that offload
> > > feature,
> > > > > or needs to seek approval for an exception.
> > > > > >
> > > > > > As another option for NICs not supporting the minimum requirements
> > > of
> > > > > an offload feature, we could introduce offload flags with
> > > > > > finer granularity. E.g. one offload flag for "gold standard" TX
> > > > > checksum update (where the packet's checksum field can have any
> > > > > > value), and another offload flag for "silver standard" TX checksum
> > > > > update (where the packet's checksum field must have a
> > > > > > precomputed value).
> > > > >
> > > > > Actually yes, I was thinking in the same direction - we need some
> > > extra
> > > > > API to allow user to distinguish.
> > > > > Probably we can do something like that: a new API for the ethdev
> > > call
> > > > > that would take as a parameter
> > > > > TX offloads bitmap and in return specify would it need to modify
> > > > > contents of packet to support these
> > > > > offloads or not.
> > > > > Something like:
> > > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > > > >
> > > > > For the majority of the drivers that satisfy these "minimum
> > > > > requirements" corresponding devops
> > > > > entry will be empty and we'll always return 0, otherwise PMD has to
> > > > > provide a proper devop.
> > > > > Then again, it would be up to the user, to determine can he pass
> > > same
> > > > > packet to 2 different NICs or not.
> > > > >
> > > > > I suppose it is similar to what you were talking about?
> > > >
> > > > I was thinking something more simple:
> > > >
> > > > The NIC exposes its RX and TX offload capabilities to the application
> > > through the rx/tx_offload_capa and other fields in the
> > > > rte_eth_dev_info structure returned by rte_eth_dev_info_get().
> > > >
> > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> > > set.
> > > > These capability flags (or enums) are mostly undocumented in the code,
> > > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> > > > capability means that the NIC is able to update the IPv4 header
> > > checksum at egress (on the wire, i.e. without modifying the mbuf or
> > > > packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM
> > > in the mbufs to utilize this offload.
> > > > I would define and document what each capability flag/enum exactly
> > > means, the minimum requirements (as defined by the DPDK
> > > > community) for the driver to claim support for it, and the
> > > requirements for an application to use it.
> > > > For the sake of discussion, let's say that
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> > > capability
> > > > (i.e. no requirements to the checksum field in the packet contents).
> > > > If some NIC requires the checksum field in the packet contents to have
> > > a precomputed value, the NIC would not be allowed to claim
> > > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > > Such a NIC would need to define and document a new capability, e.g.
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> > > > standard" TX checksum update capability.
> > > > In other words: I would encode variations of offload capabilities
> > > directly in the capabilities flags.
> > > > Then we don't need additional APIs to help interpret those
> > > capabilities.
> > >
> > > I understood your intention with different flags, yes it should work too
> > > I think.
> > > The reason I am not very fond of it - it will require to double
> > > TX_OFFLOAD flags.
> >
> > An additional feature flag is only required if a NIC is not conforming to the "minimum requirements" of an offload feature, and the
> > techboard permits introducing a variant of an existing feature.
> > There should be very few additional feature flags for variants - exceptions only - or the "minimum requirements" are not broad
> > enough to support the majority of NICs.
>
> Ok, so you suggest to group all existing reqs plus what all current tx_prepare() do into "minimum requirements"?
> So with current drivers in place we wouldn't need these new flags, but we'll reserve such opportunity.
> That might work, if there are no contradictory requirements in current PMDs, and PMDs maintainers with
> less reqs will agree with these 'extra' stuff.
Just to check how easy/hard would be to get a consensus, compiled a list of mbuf changes
done by different PMDs in tx_prepare(). See below.
Could be not fully correct or complete. PMD maintainers, feel free to update it, if I missed something.
From how it looks to me:
if we'll go the way you suggest, then hns3 and virtio will most likely become
a 'second class citizens' - will need a special offload flags for them.
Plus, either all PMDs that now set tx_prepare()=NULL will have to agree to require
rte_net_intel_cksum_prepare() to be done, or all Intel PMDs and few others will also be downgraded
to 'second class'.
PMD: atlantic
MOD: rte_net_intel_cksum_prepare()
/*for ipv4_hdr->hdr_checksum = 0; (tcp|udp)_hdr->cksum=rte_ipv(4|6)_phdr_cksum(...);*/
PMD: cpfl/idpf
MOD: none
PMD: em/igb/igc/fm10k/i40e/iavf/ice/ixgbe
MOD: rte_net_intel_cksum_prepare()
PMD: enic
MOD: rte_net_intel_cksum_prepare()
PMD: hns3
MOD: rte_net_intel_cksum_prepare() plus some extra:
/*
* A UDP packet with the same dst_port as VXLAN\VXLAN_GPE\GENEVE will
* be recognized as a tunnel packet in HW. In this case, if UDP CKSUM
* offload is set and the tunnel mask has not been set, the CKSUM will
* be wrong since the header length is wrong and driver should complete
* the CKSUM to avoid CKSUM error.
*/
PMD: ionic
MOD: none
PMD: ngbe
MOD: rte_net_intel_cksum_prepare()
PMD: qede
MOD: none
PMD: txgbe
MOD: rte_net_intel_cksum_prepare()
PMD: virtio:
MOD: rte_net_intel_cksum_prepare() plus some extra:
- for RTE_MBUF_F_TX_TCP_SEG: virtio_tso_fix_cksum()
- for RTE_MBUF_F_TX_VLAN: rte_vlan_insert()
PMD: vmxnet3
MOD: rte_net_intel_cksum_prepare()
For all other PMDs in our main tree set tx_prepare = NULL.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
@ 2024-04-18 8:20 ` David Marchand
2024-06-11 18:25 ` Ferruh Yigit
2024-04-18 8:20 ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
` (2 subsequent siblings)
3 siblings, 1 reply; 30+ messages in thread
From: David Marchand @ 2024-04-18 8:20 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang,
Olivier Matz, Konstantin Ananyev, Tomasz Kulasek
Resetting the outer IP checksum to 0 is not something mandated by the
mbuf API and is done by rte_eth_tx_prepare(), or per driver if needed.
Fixes: 4fb7e803eb1a ("ethdev: add Tx preparation")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 6711dda42e..f5125c2788 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -583,15 +583,17 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t ol_flags = 0;
if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4)) {
- ipv4_hdr->hdr_checksum = 0;
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4;
- if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM)
+ if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
- else
+ } else {
+ ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
- } else
+ }
+ } else {
ol_flags |= RTE_MBUF_F_TX_OUTER_IPV6;
+ }
if (info->outer_l4_proto != IPPROTO_UDP)
return ol_flags;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
2024-04-18 8:20 ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
@ 2024-04-18 8:20 ` David Marchand
2024-04-18 8:20 ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-18 8:20 ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
3 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-18 8:20 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Aman Singh, Yuying Zhang, Jie Hai,
Yisen Zhuang, Ferruh Yigit, Ting Xu
Setting a pseudo header checksum in the outer UDP checksum is a Intel
(and some other vendors) requirement.
Applications (like OVS) requesting outer UDP checksum without doing this
extra setup have broken outer UDP checksums.
Move this specific setup from testpmd to the "common" helper
rte_net_intel_cksum_flags_prepare().
net/hns3 can then be adjusted.
Bugzilla ID: 1406
Fixes: d8e5e69f3a9b ("app/testpmd: add GTP parsing and Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
app/test-pmd/csumonly.c | 11 +----
drivers/net/hns3/hns3_rxtx.c | 93 ++++++++++--------------------------
lib/net/rte_net.h | 18 ++++++-
3 files changed, 44 insertions(+), 78 deletions(-)
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index f5125c2788..71add6ca47 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -577,8 +577,6 @@ static uint64_t
process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
uint64_t tx_offloads, int tso_enabled, struct rte_mbuf *m)
{
- struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
- struct rte_ipv6_hdr *ipv6_hdr = outer_l3_hdr;
struct rte_udp_hdr *udp_hdr;
uint64_t ol_flags = 0;
@@ -588,6 +586,8 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM) {
ol_flags |= RTE_MBUF_F_TX_OUTER_IP_CKSUM;
} else {
+ struct rte_ipv4_hdr *ipv4_hdr = outer_l3_hdr;
+
ipv4_hdr->hdr_checksum = 0;
ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr);
}
@@ -608,13 +608,6 @@ process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
/* Skip SW outer UDP checksum generation if HW supports it */
if (tx_offloads & RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM) {
- if (info->outer_ethertype == _htons(RTE_ETHER_TYPE_IPV4))
- udp_hdr->dgram_cksum
- = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
- else
- udp_hdr->dgram_cksum
- = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
ol_flags |= RTE_MBUF_F_TX_OUTER_UDP_CKSUM;
return ol_flags;
}
diff --git a/drivers/net/hns3/hns3_rxtx.c b/drivers/net/hns3/hns3_rxtx.c
index 7e636a0a2e..03fc919fd7 100644
--- a/drivers/net/hns3/hns3_rxtx.c
+++ b/drivers/net/hns3/hns3_rxtx.c
@@ -3616,58 +3616,6 @@ hns3_pkt_need_linearized(struct rte_mbuf *tx_pkts, uint32_t bd_num,
return false;
}
-static bool
-hns3_outer_ipv4_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_IP_CKSUM)
- ipv4_hdr->hdr_checksum = 0;
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv4_hdr->next_proto_id;
- return false;
-}
-
-static bool
-hns3_outer_ipv6_cksum_prepared(struct rte_mbuf *m, uint64_t ol_flags,
- uint32_t *l4_proto)
-{
- struct rte_ipv6_hdr *ipv6_hdr;
- ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
- m->outer_l2_len);
- if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
- struct rte_udp_hdr *udp_hdr;
- /*
- * If OUTER_UDP_CKSUM is support, HW can calculate the pseudo
- * header for TSO packets
- */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
- return true;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len + m->outer_l3_len);
- udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
-
- return true;
- }
- *l4_proto = ipv6_hdr->proto;
- return false;
-}
-
static void
hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
{
@@ -3675,29 +3623,38 @@ hns3_outer_header_cksum_prepare(struct rte_mbuf *m)
uint32_t paylen, hdr_len, l4_proto;
struct rte_udp_hdr *udp_hdr;
- if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)))
+ if (!(ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) &&
+ ((ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) ||
+ !(ol_flags & RTE_MBUF_F_TX_TCP_SEG)))
return;
if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
- if (hns3_outer_ipv4_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv4_hdr *ipv4_hdr;
+
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv4_hdr->next_proto_id;
} else {
- if (hns3_outer_ipv6_cksum_prepared(m, ol_flags, &l4_proto))
- return;
+ struct rte_ipv6_hdr *ipv6_hdr;
+
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ l4_proto = ipv6_hdr->proto;
}
+ if (l4_proto != IPPROTO_UDP)
+ return;
+
/* driver should ensure the outer udp cksum is 0 for TUNNEL TSO */
- if (l4_proto == IPPROTO_UDP && (ol_flags & RTE_MBUF_F_TX_TCP_SEG)) {
- hdr_len = m->l2_len + m->l3_len + m->l4_len;
- hdr_len += m->outer_l2_len + m->outer_l3_len;
- paylen = m->pkt_len - hdr_len;
- if (paylen <= m->tso_segsz)
- return;
- udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
- m->outer_l2_len +
- m->outer_l3_len);
- udp_hdr->dgram_cksum = 0;
- }
+ hdr_len = m->l2_len + m->l3_len + m->l4_len;
+ hdr_len += m->outer_l2_len + m->outer_l3_len;
+ paylen = m->pkt_len - hdr_len;
+ if (paylen <= m->tso_segsz)
+ return;
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = 0;
}
static int
diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h
index ef3ff4c6fd..efd9d5f5ee 100644
--- a/lib/net/rte_net.h
+++ b/lib/net/rte_net.h
@@ -121,7 +121,8 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
* no offloads are requested.
*/
if (!(ol_flags & (RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_L4_MASK | RTE_MBUF_F_TX_TCP_SEG |
- RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM)))
+ RTE_MBUF_F_TX_UDP_SEG | RTE_MBUF_F_TX_OUTER_IP_CKSUM |
+ RTE_MBUF_F_TX_OUTER_UDP_CKSUM)))
return 0;
if (ol_flags & (RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IPV6)) {
@@ -135,6 +136,21 @@ rte_net_intel_cksum_flags_prepare(struct rte_mbuf *m, uint64_t ol_flags)
struct rte_ipv4_hdr *, m->outer_l2_len);
ipv4_hdr->hdr_checksum = 0;
}
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM) {
+ if (ol_flags & RTE_MBUF_F_TX_OUTER_IPV4) {
+ ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv4_hdr *,
+ m->outer_l2_len);
+ udp_hdr = (struct rte_udp_hdr *)((char *)ipv4_hdr +
+ m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, m->ol_flags);
+ } else {
+ ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct rte_ipv6_hdr *,
+ m->outer_l2_len);
+ udp_hdr = rte_pktmbuf_mtod_offset(m, struct rte_udp_hdr *,
+ m->outer_l2_len + m->outer_l3_len);
+ udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, m->ol_flags);
+ }
+ }
}
/*
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
2024-04-18 8:20 ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-18 8:20 ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
@ 2024-04-18 8:20 ` David Marchand
2024-04-18 8:20 ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
3 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-18 8:20 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jun Wang, Yuying Zhang, Jie Wang,
Beilei Xing
According to the X710 datasheet (and confirmed on the field..), X710
devices do not support outer checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities according to the hardware.
X722 may support such offload by setting I40E_TXD_CTX_QW0_L4T_CS_MASK.
Bugzilla ID: 1406
Fixes: 8cc79a1636cd ("net/i40e: fix forward outer IPv6 VXLAN")
Cc: stable@dpdk.org
Reported-by: Jun Wang <junwang01@cestc.cn>
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Note: I do not have X722 nic. Please Intel devs, check for both X710 and
X722 series.
Changes since v1:
- fix inverted check,
---
.mailmap | 1 +
drivers/net/i40e/i40e_ethdev.c | 6 +++++-
drivers/net/i40e/i40e_rxtx.c | 9 +++++++++
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/.mailmap b/.mailmap
index 3843868716..091766eca7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -719,6 +719,7 @@ Junjie Wan <wanjunjie@bytedance.com>
Jun Qiu <jun.qiu@jaguarmicro.com>
Jun W Zhou <junx.w.zhou@intel.com>
Junxiao Shi <git@mail1.yoursunny.com>
+Jun Wang <junwang01@cestc.cn>
Jun Yang <jun.yang@nxp.com>
Junyu Jiang <junyux.jiang@intel.com>
Juraj Linkeš <juraj.linkes@pantheon.tech>
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 380ce1a720..6535c7c178 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -3862,8 +3862,12 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_IPIP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO |
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
dev_info->tx_queue_offload_capa;
+ if (hw->mac.type == I40E_MAC_X722) {
+ dev_info->tx_offload_capa |=
+ RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+ }
+
dev_info->dev_capa =
RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP |
RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP;
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5d25ab4d3a..b4f7599cfc 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -295,6 +295,15 @@ i40e_parse_tunneling_params(uint64_t ol_flags,
*/
*cd_tunneling |= (tx_offload.l2_len >> 1) <<
I40E_TXD_CTX_QW0_NATLEN_SHIFT;
+
+ /**
+ * Calculate the tunneling UDP checksum (only supported with X722).
+ * Shall be set only if L4TUNT = 01b and EIPT is not zero
+ */
+ if ((*cd_tunneling & I40E_TXD_CTX_QW0_EXT_IP_MASK) &&
+ (*cd_tunneling & I40E_TXD_CTX_UDP_TUNNELING) &&
+ (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM))
+ *cd_tunneling |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
}
static inline void
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
` (2 preceding siblings ...)
2024-04-18 8:20 ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
@ 2024-04-18 8:20 ` David Marchand
3 siblings, 0 replies; 30+ messages in thread
From: David Marchand @ 2024-04-18 8:20 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, stable, Jingjing Wu, Qi Zhang, Peng Zhang,
Zhichao Zeng
According to the X710 datasheet, X710 devices do not support outer
checksum offload.
"""
8.4.4.2 Transmit L3 and L4 Integrity Offload
Tunneling UDP headers and GRE header are not offloaded while the
X710/XXV710/XL710 leaves their checksum field as is.
If a checksum is required, software should provide it as well as the inner
checksum value(s) that are required for the outer checksum.
"""
Fix Tx offload capabilities depending on the VF type.
Bugzilla ID: 1406
Fixes: f7c8c36fdeb7 ("net/iavf: enable inner and outer Tx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
doc/guides/nics/features/iavf.ini | 2 +-
drivers/net/iavf/iavf_ethdev.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/doc/guides/nics/features/iavf.ini b/doc/guides/nics/features/iavf.ini
index c59115ae15..ce9860e963 100644
--- a/doc/guides/nics/features/iavf.ini
+++ b/doc/guides/nics/features/iavf.ini
@@ -33,7 +33,7 @@ L3 checksum offload = Y
L4 checksum offload = Y
Timestamp offload = Y
Inner L3 checksum = Y
-Inner L4 checksum = Y
+Inner L4 checksum = P
Packet type parsing = Y
Rx descriptor status = Y
Tx descriptor status = Y
diff --git a/drivers/net/iavf/iavf_ethdev.c b/drivers/net/iavf/iavf_ethdev.c
index 245b3cd854..bbf915097e 100644
--- a/drivers/net/iavf/iavf_ethdev.c
+++ b/drivers/net/iavf/iavf_ethdev.c
@@ -1174,7 +1174,6 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM |
- RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO |
RTE_ETH_TX_OFFLOAD_GRE_TNL_TSO |
@@ -1183,6 +1182,10 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;
+ /* X710 does not support outer udp checksum */
+ if (adapter->hw.mac.type != IAVF_MAC_XL710)
+ dev_info->tx_offload_capa |= RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM;
+
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_CRC)
dev_info->rx_offload_capa |= RTE_ETH_RX_OFFLOAD_KEEP_CRC;
--
2.44.0
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload
2024-04-18 8:20 ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
@ 2024-06-11 18:25 ` Ferruh Yigit
0 siblings, 0 replies; 30+ messages in thread
From: Ferruh Yigit @ 2024-06-11 18:25 UTC (permalink / raw)
To: David Marchand, dev
Cc: thomas, stable, Aman Singh, Yuying Zhang, Olivier Matz,
Konstantin Ananyev, Tomasz Kulasek
On 4/18/2024 9:20 AM, David Marchand wrote:
> Resetting the outer IP checksum to 0 is not something mandated by the
> mbuf API and is done by rte_eth_tx_prepare(), or per driver if needed.
>
> Fixes: 4fb7e803eb1a ("ethdev: add Tx preparation")
> Cc: stable@dpdk.org
>
> Signed-off-by: David Marchand <david.marchand@redhat.com>
>
Acked-by: Ferruh Yigit <ferruh.yigit@amd.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-06-11 18:25 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20240405125039.897933-1-david.marchand@redhat.com>
2024-04-05 12:49 ` [PATCH 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 12:49 ` [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
[not found] ` <20240405144604.906695-1-david.marchand@redhat.com>
2024-04-05 14:45 ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 16:20 ` Morten Brørup
2024-04-08 10:12 ` David Marchand
2024-04-09 13:38 ` Konstantin Ananyev
2024-04-09 14:44 ` Morten Brørup
2024-04-10 10:35 ` Konstantin Ananyev
2024-04-10 12:20 ` Morten Brørup
2024-04-12 12:46 ` Konstantin Ananyev
2024-04-12 14:44 ` Morten Brørup
2024-04-12 15:17 ` Konstantin Ananyev
2024-04-12 15:54 ` Morten Brørup
2024-04-16 9:16 ` Konstantin Ananyev
2024-04-16 11:36 ` Konstantin Ananyev
2024-04-15 15:07 ` Ferruh Yigit
2024-04-16 7:14 ` Morten Brørup
2024-04-16 9:26 ` Konstantin Ananyev
2024-04-05 14:45 ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 14:45 ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:46 ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 14:46 ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
[not found] ` <20240418082023.1767998-1-david.marchand@redhat.com>
2024-04-18 8:20 ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
2024-06-11 18:25 ` Ferruh Yigit
2024-04-18 8:20 ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-18 8:20 ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-18 8:20 ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).