DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH v01] net/af_packet: add rollover and defrag options
@ 2024-10-29 13:48 Gur Stavi
  2024-10-29 16:09 ` Stephen Hemminger
  2024-10-30  2:41 ` Stephen Hemminger
  0 siblings, 2 replies; 4+ messages in thread
From: Gur Stavi @ 2024-10-29 13:48 UTC (permalink / raw)
  To: Gur Stavi; +Cc: dev, John W. Linville, Stephen Hemminger, Ferruh Yigit

net_af_packet PMD multi "queue" support relies on Linux FANOUT capability.
Linux FANOUT is a SW based load balancer that is similar to HW RSS which is
more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF PACKET
uses multiple sockets.
HW RSS will typically drop a packet if its selected RX queue is empty. However,
Linux FANOUT, as a SW load balancer, can be configured to avoid this packet
drop by rolling over to the next socket.
This rollover functionality was ALWAYS enabled in net_af_packet. It is
surrounded by ifdef, but only to allow compilation on ancient Linux versions
that did not have it.

Since DPDK applications are usually designed for HW based PMDs, this rollover
functionality, which the developers are likely unaware of, could be confusing.

Another option that is part of Linux FANOUT is DEFRAG that instructs Linux to
compose complete IP packet out of fragments before delivering it to the PACKET
socket. Again, this behavior typically does not exist for HW based PMDs and may
confuse users.

This patch adds 2 options to control these features:
rollover=[0|1],defrag=[0|1]
For backward compatibility both features are enabled by default even though most
users will probably want both of them disabled.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
---
 doc/guides/nics/af_packet.rst             |  4 ++
 drivers/net/af_packet/rte_eth_af_packet.c | 63 +++++++++++++++++------
 2 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index 66b977e1a2..4bc748b50b 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -29,6 +29,8 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings.
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple
     of 16B);
 *   ``framecnt`` - PACKET_MMAP frame count (optional, default 512).
+*   ``rollover`` - disable(0)/enable(1) PACKET_FANOUT_ROLLOVER (optional, default 1).
+*   ``defrag`` - disable(0)/enable(1) PACKET_FANOUT_FLAG_DEFRAG (optional, default 1).
 
 Because this implementation is based on PACKET_MMAP, and PACKET_MMAP has its
 own pre-requisites, it should be noted that the inner workings of PACKET_MMAP
@@ -47,6 +49,8 @@ For the full details behind PACKET_MMAP's structures and settings, consider
 reading the `PACKET_MMAP documentation in the Kernel
 <https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt>`_.
 
+For details of ``rollover`` anb ``defrag`` refer to the *packet(7)* man page.
+
 Prerequisites
 -------------
 
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 68870dd3e2..5ae76d2b2a 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,6 +36,8 @@
 #define ETH_AF_PACKET_FRAMESIZE_ARG	"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG	"framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG	"qdisc_bypass"
+#define ETH_AF_PACKET_ROLLOVER		"rollover"
+#define ETH_AF_PACKET_DEFRAG		"defrag"
 
 #define DFLT_FRAME_SIZE		(1 << 11)
 #define DFLT_FRAME_COUNT	(1 << 9)
@@ -91,6 +93,8 @@ static const char *valid_arguments[] = {
 	ETH_AF_PACKET_FRAMESIZE_ARG,
 	ETH_AF_PACKET_FRAMECOUNT_ARG,
 	ETH_AF_PACKET_QDISC_BYPASS_ARG,
+	ETH_AF_PACKET_ROLLOVER,
+	ETH_AF_PACKET_DEFRAG,
 	NULL
 };
 
@@ -676,6 +680,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 	const unsigned int numa_node = dev->device.numa_node;
 	struct rte_eth_dev_data *data = NULL;
 	struct rte_kvargs_pair *pair = NULL;
+	const char *ifname = NULL;
 	struct ifreq ifr;
 	size_t ifnamelen;
 	unsigned k_idx;
@@ -686,6 +691,8 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 	int rc, tpver, discard;
 	int qsockfd = -1;
 	unsigned int i, q, rdsize;
+	unsigned int rollover = 1;
+	unsigned int defrag = 1;
 #if defined(PACKET_FANOUT)
 	int fanout_arg;
 #endif
@@ -693,9 +700,30 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
 		pair = &kvlist->pairs[k_idx];
 		if (strstr(pair->key, ETH_AF_PACKET_IFACE_ARG) != NULL)
-			break;
+			ifname = pair->value;
+		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
+			rollover = atoi(pair->value);
+			if (rollover != 0 && rollover != 1) {
+				PMD_LOG(ERR,
+					"%s: invalid rollover value",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_AF_PACKET_DEFRAG) != NULL) {
+			defrag = atoi(pair->value);
+			if (defrag != 0 && defrag != 1) {
+				PMD_LOG(ERR,
+					"%s: invalid defrag value",
+				        name);
+				return -1;
+			}
+			continue;
+		}
 	}
-	if (pair == NULL) {
+
+	if (ifname == NULL) {
 		PMD_LOG(ERR,
 			"%s: no interface specified for AF_PACKET ethdev",
 		        name);
@@ -738,21 +766,21 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 	req->tp_frame_size = framesize;
 	req->tp_frame_nr = framecnt;
 
-	ifnamelen = strlen(pair->value);
+	ifnamelen = strlen(ifname);
 	if (ifnamelen < sizeof(ifr.ifr_name)) {
-		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		memcpy(ifr.ifr_name, ifname, ifnamelen);
 		ifr.ifr_name[ifnamelen] = '\0';
 	} else {
 		PMD_LOG(ERR,
 			"%s: I/F name too long (%s)",
-			name, pair->value);
+			name, ifname);
 		goto free_internals;
 	}
 	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
 		PMD_LOG_ERRNO(ERR, "%s: ioctl failed (SIOCGIFINDEX)", name);
 		goto free_internals;
 	}
-	(*internals)->if_name = strdup(pair->value);
+	(*internals)->if_name = strdup(ifname);
 	if ((*internals)->if_name == NULL)
 		goto free_internals;
 	(*internals)->if_index = ifr.ifr_ifindex;
@@ -770,9 +798,12 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 
 #if defined(PACKET_FANOUT)
 	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
-	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG) << 16;
+	fanout_arg |= PACKET_FANOUT_HASH << 16;
+	if (defrag)
+		fanout_arg |= PACKET_FANOUT_FLAG_DEFRAG << 16;
 #if defined(PACKET_FANOUT_FLAG_ROLLOVER)
-	fanout_arg |= PACKET_FANOUT_FLAG_ROLLOVER << 16;
+	if (rollover)
+		fanout_arg |= PACKET_FANOUT_FLAG_ROLLOVER << 16;
 #endif
 #endif
 
@@ -792,7 +823,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_VERSION on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -802,7 +833,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_LOSS on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -813,7 +844,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 			if (rc == -1) {
 				PMD_LOG_ERRNO(ERR,
 					"%s: could not set PACKET_QDISC_BYPASS on AF_PACKET socket for %s",
-					name, pair->value);
+					name, ifname);
 				goto error;
 			}
 #endif
@@ -823,7 +854,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_RX_RING on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -831,7 +862,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_TX_RING on AF_PACKET "
-				"socket for %s", name, pair->value);
+				"socket for %s", name, ifname);
 			goto error;
 		}
 
@@ -844,7 +875,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rx_queue->map == MAP_FAILED) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: call to mmap failed on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -881,7 +912,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not bind AF_PACKET socket to %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -891,7 +922,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_FANOUT on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 #endif

base-commit: 98613d32e3dac58d685f4f236cf8cc9733abaaf3
-- 
2.40.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v01] net/af_packet: add rollover and defrag options
  2024-10-29 13:48 [PATCH v01] net/af_packet: add rollover and defrag options Gur Stavi
@ 2024-10-29 16:09 ` Stephen Hemminger
  2024-10-29 18:27   ` Gur Stavi
  2024-10-30  2:41 ` Stephen Hemminger
  1 sibling, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2024-10-29 16:09 UTC (permalink / raw)
  To: Gur Stavi; +Cc: dev, John W. Linville, Ferruh Yigit

On Tue, 29 Oct 2024 15:48:05 +0200
Gur Stavi <gur.stavi@huawei.com> wrote:

> net_af_packet PMD multi "queue" support relies on Linux FANOUT capability.
> Linux FANOUT is a SW based load balancer that is similar to HW RSS which is
> more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF PACKET
> uses multiple sockets.
> HW RSS will typically drop a packet if its selected RX queue is empty. However,
> Linux FANOUT, as a SW load balancer, can be configured to avoid this packet
> drop by rolling over to the next socket.
> This rollover functionality was ALWAYS enabled in net_af_packet. It is
> surrounded by ifdef, but only to allow compilation on ancient Linux versions
> that did not have it.
> 
> Since DPDK applications are usually designed for HW based PMDs, this rollover
> functionality, which the developers are likely unaware of, could be confusing.
> 
> Another option that is part of Linux FANOUT is DEFRAG that instructs Linux to
> compose complete IP packet out of fragments before delivering it to the PACKET
> socket. Again, this behavior typically does not exist for HW based PMDs and may
> confuse users.
> 
> This patch adds 2 options to control these features:
> rollover=[0|1],defrag=[0|1]
> For backward compatibility both features are enabled by default even though most
> users will probably want both of them disabled.
> 
> Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
> ---

Makes sense to expose kernel options. But have all combinations been tested?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH v01] net/af_packet: add rollover and defrag options
  2024-10-29 16:09 ` Stephen Hemminger
@ 2024-10-29 18:27   ` Gur Stavi
  0 siblings, 0 replies; 4+ messages in thread
From: Gur Stavi @ 2024-10-29 18:27 UTC (permalink / raw)
  To: 'Stephen Hemminger'; +Cc: dev, John W. Linville, Ferruh Yigit

> On Tue, 29 Oct 2024 15:48:05 +0200
> Gur Stavi <gur.stavi@huawei.com> wrote:
> 
> > net_af_packet PMD multi "queue" support relies on Linux FANOUT
> capability.
> > Linux FANOUT is a SW based load balancer that is similar to HW RSS
> which is
> > more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF
> PACKET
> > uses multiple sockets.
> > HW RSS will typically drop a packet if its selected RX queue is empty.
> However,
> > Linux FANOUT, as a SW load balancer, can be configured to avoid this
> packet
> > drop by rolling over to the next socket.
> > This rollover functionality was ALWAYS enabled in net_af_packet. It is
> > surrounded by ifdef, but only to allow compilation on ancient Linux
> versions
> > that did not have it.
> >
> > Since DPDK applications are usually designed for HW based PMDs, this
> rollover
> > functionality, which the developers are likely unaware of, could be
> confusing.
> >
> > Another option that is part of Linux FANOUT is DEFRAG that instructs
> Linux to
> > compose complete IP packet out of fragments before delivering it to the
> PACKET
> > socket. Again, this behavior typically does not exist for HW based PMDs
> and may
> > confuse users.
> >
> > This patch adds 2 options to control these features:
> > rollover=[0|1],defrag=[0|1]
> > For backward compatibility both features are enabled by default even
> though most
> > users will probably want both of them disabled.
> >
> > Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
> > ---
> 
> Makes sense to expose kernel options. But have all combinations been
> tested?

I tested the command line parsing of the new options.
I tested that the PMD is initialized successfully with all combinations.
I did not test if Linux actually behaves according to the man page. I did
not test traffic at all.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v01] net/af_packet: add rollover and defrag options
  2024-10-29 13:48 [PATCH v01] net/af_packet: add rollover and defrag options Gur Stavi
  2024-10-29 16:09 ` Stephen Hemminger
@ 2024-10-30  2:41 ` Stephen Hemminger
  1 sibling, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2024-10-30  2:41 UTC (permalink / raw)
  To: Gur Stavi; +Cc: dev, John W. Linville, Ferruh Yigit

On Tue, 29 Oct 2024 15:48:05 +0200
Gur Stavi <gur.stavi@huawei.com> wrote:

> +		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
> +			rollover = atoi(pair->value);
> +			if (rollover != 0 && rollover != 1) {
> +				PMD_LOG(ERR,
> +					"%s: invalid rollover value",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}

The problem is that atoi() provides little to no error handling.
Prefer using strtoul() and/or having a common routine for parsing flag values.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-30  2:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-29 13:48 [PATCH v01] net/af_packet: add rollover and defrag options Gur Stavi
2024-10-29 16:09 ` Stephen Hemminger
2024-10-29 18:27   ` Gur Stavi
2024-10-30  2:41 ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).