DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/2] TAP TSO Implementation
@ 2018-03-09 21:10 Ophir Munk
  2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Ophir Munk @ 2018-03-09 21:10 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This RFC suggests TAP TSO (TSP segmentation offload) implementation in SW.
It uses dpdk library rte_gso which is also used by testpmd.
Dpdk rte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This RFC includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated small TCP packets may be composed by themselves by more than
one segment.
2. Core TAP TSO implementation: calling rte_gso_segment() segments
large TCP packets.
To be added: creation of a small private mbuf pool in TAP required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.

Ophir Munk (2):
  net/tap: calculate checksum for multi segs packets
  net/tap: implement TAP TSO

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 183 +++++++++++++++++++++++++++++++++---------
 drivers/net/tap/rte_eth_tap.h |   4 +
 3 files changed, 150 insertions(+), 39 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets
  2018-03-09 21:10 [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ophir Munk
@ 2018-03-09 21:10 ` Ophir Munk
  2018-04-09 22:33   ` [dpdk-dev] [PATCH v1 0/2] TAP TSO Ophir Munk
  2018-06-12 16:31   ` [dpdk-dev] [PATCH v4 0/2] TAP TSO Ophir Munk
  2018-03-09 21:10 ` [dpdk-dev] [RFC 2/2] net/tap: implement TAP TSO Ophir Munk
  2018-04-09 16:38 ` [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ferruh Yigit
  2 siblings, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-03-09 21:10 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

In past TAP implementation checksum offload calculations (for
IP/UDP/TCP) were skipped in case of a multi segments packet.
This commit improves TAP functionality by enabling checksum calculations
in multi segments cases.
The only restriction now is that the first segment must contain all
headers of layers 2, 3 and 4 (where layer 4 header size is taken as
that of TCP).

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 42 ++++++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index f09db0e..f312084 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -496,6 +496,9 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char m_copy[mbuf->data_len];
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_len; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
@@ -503,25 +506,44 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 		iovecs[0].iov_base = &pi;
 		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		k = 1;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			/* Only support packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_len = mbuf->l2_len + mbuf->l3_len +
+				sizeof(struct tcp_hdr);
+			if (seg_len < l234_len)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
+					l234_len);
 			tap_tx_offload(m_copy, mbuf->ol_flags,
 				       mbuf->l2_len, mbuf->l3_len);
 			iovecs[1].iov_base = m_copy;
+			iovecs[1].iov_len = l234_len;
+			k++;
+			/* Adjust data pointer beyond l2, l3, l4 headers.
+			 * If this segment becomes empty - skip it
+			 */
+			if (seg_len > l234_len) {
+				rte_pktmbuf_adj(mbuf, l234_len);
+			} else {
+				seg = seg->next;
+				mbuf->nb_segs--;
+			}
+		}
+		for (j = k; j <= mbuf->nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			seg = seg->next;
 		}
 		/* copy the tx frame data */
 		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [RFC 2/2] net/tap: implement TAP TSO
  2018-03-09 21:10 [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ophir Munk
  2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
@ 2018-03-09 21:10 ` Ophir Munk
  2018-04-09 16:38 ` [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ferruh Yigit
  2 siblings, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-03-09 21:10 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
Dpdk rte_gso library is used to segment large TCP payloads (e.g. 64K bytes)
into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on dpdk librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TSO segments is limited to 64.

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 157 ++++++++++++++++++++++++++++++++----------
 drivers/net/tap/rte_eth_tap.h |   4 ++
 3 files changed, 126 insertions(+), 37 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index f312084..4dda100 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -473,40 +473,37 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static void
+tap_mbuf_pool_create(struct rte_mempool **mp)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
-	int i;
+	*mp = NULL; /* TODO - create mp */
+}
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
+{
+	int i;
 
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t l234_len; /* length of layers 2,3,4 headers */
 		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
-
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		k = 1;
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
@@ -523,39 +520,99 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			/* To change checksums, work on a
 			 * copy of l2, l3 l4 headers.
 			 */
-			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-					l234_len);
+			rte_memcpy(m_copy,
+				rte_pktmbuf_mtod(mbuf, void *), l234_len);
 			tap_tx_offload(m_copy, mbuf->ol_flags,
 				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
-			iovecs[1].iov_len = l234_len;
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_len;
 			k++;
+
 			/* Adjust data pointer beyond l2, l3, l4 headers.
 			 * If this segment becomes empty - skip it
 			 */
 			if (seg_len > l234_len) {
-				rte_pktmbuf_adj(mbuf, l234_len);
-			} else {
-				seg = seg->next;
-				mbuf->nb_segs--;
+				iovecs[k].iov_len = seg_len - l234_len;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_len;
+				k++;
+			} else { /* seg_len == l234_len */
+				nb_segs--;
 			}
+
+			seg = seg->next;
 		}
-		for (j = k; j <= mbuf->nb_segs; j++) {
+		for (j = k; j <= nb_segs; j++) {
 			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
 			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
 			seg = seg->next;
 		}
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
 
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint32_t max_size;
+	int i;
+	uint64_t tso;
+	int ret;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+			/* gso size is calculated without ETHER_CRC_LEN */
+			gso_ctx->gso_size = *txq->mtu + ETHER_HDR_LEN;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
+
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		rte_pktmbuf_free(mbuf_in);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -996,11 +1053,35 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 }
 
 static int
+tap_init_gso_ctx(struct tx_queue *tx)
+{
+	uint32_t gso_types;
+
+	/* Create private mbuf pool with 128 bytes size
+	 * use this pool for both direct and indirect mbufs
+	 */
+	struct rte_mempool *mp;      /* Mempool for TX/GSO packets */
+	tap_mbuf_pool_create(&mp); /* tx->mp or maybe embedded in gso_ctx */
+
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO | DEV_TX_OFFLOAD_VXLAN_TNL_TSO |
+		DEV_TX_OFFLOAD_GRE_TNL_TSO;
+	tx->gso_ctx.direct_pool = mp;
+	tx->gso_ctx.indirect_pool = mp;
+	tx->gso_ctx.gso_types = gso_types;
+	tx->gso_ctx.gso_size = ETHER_MAX_LEN - ETHER_CRC_LEN;
+	tx->gso_ctx.flag = 0;
+
+	return 0;
+}
+
+static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
@@ -1048,6 +1129,10 @@ tap_setup_queue(struct rte_eth_dev *dev,
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
 
+	ret = tap_init_gso_ctx(tx);
+	if (ret)
+		return -1;
+
 	return *fd;
 }
 
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 53a506a..65da5f8 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 
 #ifdef IFF_MULTI_QUEUE
 #define RTE_PMD_TAP_MAX_QUEUES	TAP_MAX_QUEUES
@@ -22,6 +23,8 @@
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
 
+#define MAX_GSO_MBUFS 64
+
 struct pkt_stats {
 	uint64_t opackets;              /* Number of output packets */
 	uint64_t ipackets;              /* Number of input packets */
@@ -50,6 +53,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
  2018-03-09 21:10 [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ophir Munk
  2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
  2018-03-09 21:10 ` [dpdk-dev] [RFC 2/2] net/tap: implement TAP TSO Ophir Munk
@ 2018-04-09 16:38 ` Ferruh Yigit
  2018-04-09 22:37   ` Ophir Munk
  2 siblings, 1 reply; 31+ messages in thread
From: Ferruh Yigit @ 2018-04-09 16:38 UTC (permalink / raw)
  To: Ophir Munk, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern

On 3/9/2018 9:10 PM, Ophir Munk wrote:
> This RFC suggests TAP TSO (TSP segmentation offload) implementation in SW.
> It uses dpdk library rte_gso which is also used by testpmd.
> Dpdk rte_gso library segments large TCP payloads (e.g. 64K bytes)
> into smaller MTU size buffers.
> By supporting TSO offload capability in software a TAP device can be used
> as a failsafe sub device and be paired with another PCI device which
> supports TSO capability in HW.
> 
> This RFC includes 2 commits:
> 1. Calculation of IP/TCP/UDP checksums for multi segments packets.
> Previously checksum offload was skipped if the number of packet segments
> was greater than 1.
> This commit removes this limitation. It is required before supporting TAP TSO
> since the generated small TCP packets may be composed by themselves by more than
> one segment.
> 2. Core TAP TSO implementation: calling rte_gso_segment() segments
> large TCP packets.
> To be added: creation of a small private mbuf pool in TAP required by librte_gso.
> The number of buffers will be 64 - each of 128 bytes length.
> 
> Ophir Munk (2):
>   net/tap: calculate checksum for multi segs packets
>   net/tap: implement TAP TSO

This is an RFC, and V1 not sent for the patch, is this still valid for this
release or should we push into next one?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v1 0/2] TAP TSO
  2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
@ 2018-04-09 22:33   ` Ophir Munk
  2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  2018-06-12 16:31   ` [dpdk-dev] [PATCH v4 0/2] TAP TSO Ophir Munk
  1 sibling, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-09 22:33 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This patch implements TAP TSO (TSP segmentation offload) in SW.
It uses dpdk library librte_gso.
Dpdk librte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This patch includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated TCP TSO may be composed of two segments where the first segment
includes all headers up to layer 4 with their calculated checksums (it is librte_gso way
of building TCP segments)
2. TAP TSO implementation: calling rte_gso_segment() to segment large TCP packets.
This commits creates of a small private mbuf pool in TAP PMD required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.

Ophir Munk (2):
  net/tap: calculate checksums of multi segs packets
  net/tap: support TSO (TCP Segment Offload)

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 205 ++++++++++++++++++++++++++++++++++--------
 drivers/net/tap/rte_eth_tap.h |   4 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 174 insertions(+), 41 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v1 1/2] net/tap: calculate checksums of multi segs packets
  2018-04-09 22:33   ` [dpdk-dev] [PATCH v1 0/2] TAP TSO Ophir Munk
@ 2018-04-09 22:33     ` Ophir Munk
  2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-09 22:33 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

Prior to this commit IP/UDP/TCP checksum offload calculations
were skipped in case of a multi segments packet.
This commit enables TAP checksum calculations for multi segments
packets.
The only restriction is that the first segment must contain all
headers of layers 2, 3 and 4 (where layer 4 header size equals
TCP header size).

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 54 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index 61d6465..df23c4d 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -509,6 +509,10 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char m_copy[mbuf->data_len];
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_len; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
@@ -529,30 +533,52 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		if (j & (0x40 | 0x60))
 			pi.proto = (j == 0x40) ? 0x0008 : 0xdd86;
 
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			/* Support only packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_len = mbuf->l2_len + mbuf->l3_len +
+				sizeof(struct tcp_hdr);
+			if (seg_len < l234_len)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
+					l234_len);
 			tap_tx_offload(m_copy, mbuf->ol_flags,
 				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_len;
+			k++;
+			/* Update next iovecs[] beyond l2, l3, l4 headers */
+			if (seg_len > l234_len) {
+				iovecs[k].iov_len = seg_len - l234_len;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_len;
+				k++;
+			}
+			nb_segs--;
+			seg = seg->next;
+		}
+		for (j = k; j <= nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			seg = seg->next;
 		}
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v1 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-04-09 22:33   ` [dpdk-dev] [PATCH v1 0/2] TAP TSO Ophir Munk
  2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-04-09 22:33     ` Ophir Munk
  2018-04-22 11:30       ` [dpdk-dev] [PATCH v2 0/2] TAP TSO Ophir Munk
  1 sibling, 1 reply; 31+ messages in thread
From: Ophir Munk @ 2018-04-09 22:33 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
librte_gso library is used to segment large TCP payloads (e.g. packets
of 64K bytes size) into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TCP TSO segments is limited to 64.

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 153 +++++++++++++++++++++++++++++++++++-------
 drivers/net/tap/rte_eth_tap.h |   4 ++
 mk/rte.app.mk                 |   4 +-
 4 files changed, 135 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index df23c4d..717a2b1 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -17,6 +17,7 @@
 #include <rte_ip.h>
 #include <rte_string_fns.h>
 
+#include <assert.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
@@ -408,7 +409,8 @@ tap_tx_offload_get_port_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 static uint64_t
@@ -417,7 +419,8 @@ tap_tx_offload_get_queue_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 static bool
@@ -486,38 +489,26 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
 	int i;
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
-
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t l234_len; /* length of layers 2,3,4 headers */
 		uint16_t seg_len; /* length of first segment */
 		uint16_t nb_segs;
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
-
 		/*
 		 * TUN and TAP are created with IFF_NO_PI disabled.
 		 * For TUN PMD this mandatory as fields are used by
@@ -581,13 +572,75 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
+
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint16_t tso_segsz = 0;
+	uint32_t max_size;
+	int i;
+	uint64_t tso;
+	int ret;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+			assert(gso_ctx != NULL);
+			/* gso size is calculated without ETHER_CRC_LEN */
+			tso_segsz = mbuf_in->tso_segsz;
+			if (unlikely(tso_segsz == 0) ||
+				tso_segsz > max_size) {
+				txq->stats.errs++;
+				break;
+			}
+			gso_ctx->gso_size = tso_segsz;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
 
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		rte_pktmbuf_free(mbuf_in);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -1027,32 +1080,77 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 	}
 }
 
+#define TAP_GSO_MBUFS_NUM 64
+#define TAP_GSO_MBUF_SEG_SIZE 128
+
+static int
+tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
+{
+	uint32_t gso_types;
+	char pool_name[64];
+
+	/* Create private mbuf pool with 128 bytes size per mbuf
+	 * use this pool for both direct and indirect mbufs
+	 */
+
+	struct rte_mempool *mp;      /* Mempool for GSO packets */
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO | DEV_TX_OFFLOAD_VXLAN_TNL_TSO |
+		DEV_TX_OFFLOAD_GRE_TNL_TSO;
+	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
+	mp = rte_mempool_lookup((const char *)pool_name);
+	if (!mp) {
+		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
+			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
+			SOCKET_ID_ANY);
+		if (!mp) {
+			struct pmd_internals *pmd = dev->data->dev_private;
+			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
+				pmd->name, dev->device->name);
+			return -1;
+		}
+	}
+
+	gso_ctx->direct_pool = mp;
+	gso_ctx->indirect_pool = mp;
+	gso_ctx->gso_types = gso_types;
+	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
+	gso_ctx->flag = 0;
+
+	return 0;
+}
+
 static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
 	struct pmd_internals *pmd = dev->data->dev_private;
 	struct rx_queue *rx = &internals->rxq[qid];
 	struct tx_queue *tx = &internals->txq[qid];
+	struct rte_gso_ctx *gso_ctx;
 
 	if (is_rx) {
 		fd = &rx->fd;
 		other_fd = &tx->fd;
 		dir = "rx";
+		gso_ctx = NULL;
 	} else {
 		fd = &tx->fd;
 		other_fd = &rx->fd;
 		dir = "tx";
+		gso_ctx = &tx->gso_ctx;
 	}
 	if (*fd != -1) {
 		/* fd for this queue already exists */
 		RTE_LOG(DEBUG, PMD, "%s: fd %d for %s queue qid %d exists\n",
 			pmd->name, *fd, dir, qid);
+		gso_ctx = NULL;
 	} else if (*other_fd != -1) {
 		/* Only other_fd exists. dup it */
 		*fd = dup(*other_fd);
@@ -1079,6 +1177,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
 
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
+	if (gso_ctx) {
+		ret = tap_gso_ctx_setup(gso_ctx, dev);
+		if (ret)
+			return -1;
+	}
 
 	return *fd;
 }
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 53a506a..65da5f8 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 
 #ifdef IFF_MULTI_QUEUE
 #define RTE_PMD_TAP_MAX_QUEUES	TAP_MAX_QUEUES
@@ -22,6 +23,8 @@
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
 
+#define MAX_GSO_MBUFS 64
+
 struct pkt_stats {
 	uint64_t opackets;              /* Number of output packets */
 	uint64_t ipackets;              /* Number of input packets */
@@ -50,6 +53,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 005803a..62cf545 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -66,8 +66,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
 _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
 # librte_acl needs --whole-archive because of weak functions
@@ -86,6 +84,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-y += --whole-archive
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
  2018-04-09 16:38 ` [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ferruh Yigit
@ 2018-04-09 22:37   ` Ophir Munk
  2018-04-10 14:30     ` Ferruh Yigit
  0 siblings, 1 reply; 31+ messages in thread
From: Ophir Munk @ 2018-04-09 22:37 UTC (permalink / raw)
  To: Ferruh Yigit, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern

Patch sent for this release.

> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Monday, April 09, 2018 7:39 PM
> To: Ophir Munk <ophirmu@mellanox.com>; dev@dpdk.org; Pascal Mazon
> <pascal.mazon@6wind.com>
> Cc: Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> <olgas@mellanox.com>
> Subject: Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
> 
> On 3/9/2018 9:10 PM, Ophir Munk wrote:
> > This RFC suggests TAP TSO (TSP segmentation offload) implementation in
> SW.
> > It uses dpdk library rte_gso which is also used by testpmd.
> > Dpdk rte_gso library segments large TCP payloads (e.g. 64K bytes) into
> > smaller MTU size buffers.
> > By supporting TSO offload capability in software a TAP device can be
> > used as a failsafe sub device and be paired with another PCI device
> > which supports TSO capability in HW.
> >
> > This RFC includes 2 commits:
> > 1. Calculation of IP/TCP/UDP checksums for multi segments packets.
> > Previously checksum offload was skipped if the number of packet
> > segments was greater than 1.
> > This commit removes this limitation. It is required before supporting
> > TAP TSO since the generated small TCP packets may be composed by
> > themselves by more than one segment.
> > 2. Core TAP TSO implementation: calling rte_gso_segment() segments
> > large TCP packets.
> > To be added: creation of a small private mbuf pool in TAP required by
> librte_gso.
> > The number of buffers will be 64 - each of 128 bytes length.
> >
> > Ophir Munk (2):
> >   net/tap: calculate checksum for multi segs packets
> >   net/tap: implement TAP TSO
> 
> This is an RFC, and V1 not sent for the patch, is this still valid for this release
> or should we push into next one?

V1 was sent for this release: 
https://dpdk.org/dev/patchwork/patch/37757/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
  2018-04-09 22:37   ` Ophir Munk
@ 2018-04-10 14:30     ` Ferruh Yigit
  2018-04-10 15:31       ` Ophir Munk
  0 siblings, 1 reply; 31+ messages in thread
From: Ferruh Yigit @ 2018-04-10 14:30 UTC (permalink / raw)
  To: Ophir Munk, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern

On 4/9/2018 11:37 PM, Ophir Munk wrote:
> Patch sent for this release.
> 
>> -----Original Message-----
>> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
>> Sent: Monday, April 09, 2018 7:39 PM
>> To: Ophir Munk <ophirmu@mellanox.com>; dev@dpdk.org; Pascal Mazon
>> <pascal.mazon@6wind.com>
>> Cc: Thomas Monjalon <thomas@monjalon.net>; Olga Shern
>> <olgas@mellanox.com>
>> Subject: Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
>>
>> On 3/9/2018 9:10 PM, Ophir Munk wrote:
>>> This RFC suggests TAP TSO (TSP segmentation offload) implementation in
>> SW.
>>> It uses dpdk library rte_gso which is also used by testpmd.
>>> Dpdk rte_gso library segments large TCP payloads (e.g. 64K bytes) into
>>> smaller MTU size buffers.
>>> By supporting TSO offload capability in software a TAP device can be
>>> used as a failsafe sub device and be paired with another PCI device
>>> which supports TSO capability in HW.
>>>
>>> This RFC includes 2 commits:
>>> 1. Calculation of IP/TCP/UDP checksums for multi segments packets.
>>> Previously checksum offload was skipped if the number of packet
>>> segments was greater than 1.
>>> This commit removes this limitation. It is required before supporting
>>> TAP TSO since the generated small TCP packets may be composed by
>>> themselves by more than one segment.
>>> 2. Core TAP TSO implementation: calling rte_gso_segment() segments
>>> large TCP packets.
>>> To be added: creation of a small private mbuf pool in TAP required by
>> librte_gso.
>>> The number of buffers will be 64 - each of 128 bytes length.
>>>
>>> Ophir Munk (2):
>>>   net/tap: calculate checksum for multi segs packets
>>>   net/tap: implement TAP TSO
>>
>> This is an RFC, and V1 not sent for the patch, is this still valid for this release
>> or should we push into next one?
> 
> V1 was sent for this release: 
> https://dpdk.org/dev/patchwork/patch/37757/

"was sent" :) that patch is same date with this mail...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
  2018-04-10 14:30     ` Ferruh Yigit
@ 2018-04-10 15:31       ` Ophir Munk
  0 siblings, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-10 15:31 UTC (permalink / raw)
  To: Ferruh Yigit, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern



> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Tuesday, April 10, 2018 5:31 PM
> To: Ophir Munk <ophirmu@mellanox.com>; dev@dpdk.org; Pascal Mazon
> <pascal.mazon@6wind.com>
> Cc: Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> <olgas@mellanox.com>
> Subject: Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
> 
> On 4/9/2018 11:37 PM, Ophir Munk wrote:
> > Patch sent for this release.
> >
> >> -----Original Message-----
> >> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> >> Sent: Monday, April 09, 2018 7:39 PM
> >> To: Ophir Munk <ophirmu@mellanox.com>; dev@dpdk.org; Pascal Mazon
> >> <pascal.mazon@6wind.com>
> >> Cc: Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> >> <olgas@mellanox.com>
> >> Subject: Re: [dpdk-dev] [RFC 0/2] TAP TSO Implementation
> >>
> >> On 3/9/2018 9:10 PM, Ophir Munk wrote:
> >>> This RFC suggests TAP TSO (TSP segmentation offload) implementation
> >>> in
> >> SW.
> >>> It uses dpdk library rte_gso which is also used by testpmd.
> >>> Dpdk rte_gso library segments large TCP payloads (e.g. 64K bytes)
> >>> into smaller MTU size buffers.
> >>> By supporting TSO offload capability in software a TAP device can be
> >>> used as a failsafe sub device and be paired with another PCI device
> >>> which supports TSO capability in HW.
> >>>
> >>> This RFC includes 2 commits:
> >>> 1. Calculation of IP/TCP/UDP checksums for multi segments packets.
> >>> Previously checksum offload was skipped if the number of packet
> >>> segments was greater than 1.
> >>> This commit removes this limitation. It is required before
> >>> supporting TAP TSO since the generated small TCP packets may be
> >>> composed by themselves by more than one segment.
> >>> 2. Core TAP TSO implementation: calling rte_gso_segment() segments
> >>> large TCP packets.
> >>> To be added: creation of a small private mbuf pool in TAP required
> >>> by
> >> librte_gso.
> >>> The number of buffers will be 64 - each of 128 bytes length.
> >>>
> >>> Ophir Munk (2):
> >>>   net/tap: calculate checksum for multi segs packets
> >>>   net/tap: implement TAP TSO
> >>
> >> This is an RFC, and V1 not sent for the patch, is this still valid
> >> for this release or should we push into next one?
> >
> > V1 was sent for this release:
> >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpd
> >
> k.org%2Fdev%2Fpatchwork%2Fpatch%2F37757%2F&data=02%7C01%7Coph
> irmu%40me
> >
> llanox.com%7C057123ad0d0f4109362e08d59eefabc4%7Ca652971c7d2e4d9
> ba6a4d1
> >
> 49256f461b%7C0%7C0%7C636589674583274060&sdata=Bumv4gojSyz6QRC
> yVI7n3HJA
> > qS%2FZwKRICghcndKu3is%3D&reserved=0
> 
> "was sent" :) that patch is same date with this mail...

I hope it has an earlier delivery time though... :)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] TAP TSO
  2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
@ 2018-04-22 11:30       ` Ophir Munk
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  0 siblings, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-22 11:30 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

v1: 
- Initial release
v2: 
- Fixing cksum errors
- TCP segment size refers to TCP payload size (not including l2,l3,l4 headers)

This patch implements TAP TSO (TSP segmentation offload) in SW.
It uses dpdk library librte_gso.
Dpdk librte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This patch includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated TCP TSO may be composed of two segments where the first segment
includes l2,l3,l4 headers.
2. TAP TSO implementation: calling rte_gso_segment() to segment large TCP packets.
This commits creates of a small private mbuf pool in TAP PMD required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.
TSO segments size refers to TCP payload size (not including l2,l3,l4 headers)
librte_gso supports TCP segmentation above IPv4

Ophir Munk (2):
  net/tap: calculate checksums of multi segs packets
  net/tap: support TSO (TCP Segment Offload)

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 305 ++++++++++++++++++++++++++++++++----------
 drivers/net/tap/rte_eth_tap.h |   4 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 239 insertions(+), 76 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets
  2018-04-22 11:30       ` [dpdk-dev] [PATCH v2 0/2] TAP TSO Ophir Munk
@ 2018-04-22 11:30         ` Ophir Munk
  2018-05-07 21:54           ` [dpdk-dev] [PATCH v3 0/2] TAP TSO Ophir Munk
  2018-05-31 13:52           ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ferruh Yigit
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-22 11:30 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

Prior to this commit IP/UDP/TCP checksum offload calculations
were skipped in case of a multi segments packet.
This commit enables TAP checksum calculations for multi segments
packets.
The only restriction is that the first segment must contain
headers of layers 3 (IP) and 4 (UDP or TCP)

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 154 ++++++++++++++++++++++++++++--------------
 1 file changed, 104 insertions(+), 50 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index 66e026f..d77a64f 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -436,12 +436,43 @@ tap_txq_are_offloads_valid(struct rte_eth_dev *dev, uint64_t offloads)
 	return true;
 }
 
+/* Finalize l4 checksum calculation */
 static void
-tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
-	       unsigned int l3_len)
+tap_tx_l4_cksum(uint16_t *l4_cksum, uint16_t l4_phdr_cksum,
+		uint32_t l4_raw_cksum)
 {
-	void *l3_hdr = packet + l2_len;
+	if (l4_cksum) {
+		uint32_t cksum;
+
+		cksum = __rte_raw_cksum_reduce(l4_raw_cksum);
+		cksum += l4_phdr_cksum;
+
+		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
+		cksum = (~cksum) & 0xffff;
+		if (cksum == 0)
+			cksum = 0xffff;
+		*l4_cksum = cksum;
+	}
+}
 
+/* Accumaulate L4 raw checksums */
+static void
+tap_tx_l4_add_rcksum(char *l4_data, unsigned int l4_len, uint16_t *l4_cksum,
+			uint32_t *l4_raw_cksum)
+{
+	if (l4_cksum == NULL)
+		return;
+
+	*l4_raw_cksum = __rte_raw_cksum(l4_data, l4_len, *l4_raw_cksum);
+}
+
+/* L3 and L4 pseudo headers checksum offloads */
+static void
+tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
+		unsigned int l3_len, unsigned int l4_len, uint16_t **l4_cksum,
+		uint16_t *l4_phdr_cksum, uint32_t *l4_raw_cksum)
+{
+	void *l3_hdr = packet + l2_len;
 	if (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4)) {
 		struct ipv4_hdr *iph = l3_hdr;
 		uint16_t cksum;
@@ -451,38 +482,21 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 		iph->hdr_checksum = (cksum == 0xffff) ? cksum : ~cksum;
 	}
 	if (ol_flags & PKT_TX_L4_MASK) {
-		uint16_t l4_len;
-		uint32_t cksum;
-		uint16_t *l4_cksum;
 		void *l4_hdr;
 
 		l4_hdr = packet + l2_len + l3_len;
 		if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM)
-			l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
+			*l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
 		else if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM)
-			l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
+			*l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
 		else
 			return;
-		*l4_cksum = 0;
-		if (ol_flags & PKT_TX_IPV4) {
-			struct ipv4_hdr *iph = l3_hdr;
-
-			l4_len = rte_be_to_cpu_16(iph->total_length) - l3_len;
-			cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
-		} else {
-			struct ipv6_hdr *ip6h = l3_hdr;
-
-			/* payload_len does not include ext headers */
-			l4_len = rte_be_to_cpu_16(ip6h->payload_len) -
-				l3_len + sizeof(struct ipv6_hdr);
-			cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
-		}
-		cksum += rte_raw_cksum(l4_hdr, l4_len);
-		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
-		cksum = (~cksum) & 0xffff;
-		if (cksum == 0)
-			cksum = 0xffff;
-		*l4_cksum = cksum;
+		**l4_cksum = 0;
+		if (ol_flags & PKT_TX_IPV4)
+			*l4_phdr_cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
+		else
+			*l4_phdr_cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
+		*l4_raw_cksum = __rte_raw_cksum(l4_hdr, l4_len, 0);
 	}
 }
 
@@ -503,17 +517,25 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
 	for (i = 0; i < nb_pkts; i++) {
 		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
+		int proto;
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
+		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
+		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
+		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
 			break;
-
+		l4_cksum = NULL;
 		/*
 		 * TUN and TAP are created with IFF_NO_PI disabled.
 		 * For TUN PMD this mandatory as fields are used by
@@ -525,34 +547,66 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		 * value 0x00 is taken for protocol field.
 		 */
 		char *buff_data = rte_pktmbuf_mtod(seg, void *);
-		j = (*buff_data & 0xf0);
-		pi.proto = (j == 0x40) ? 0x0008 :
-				(j == 0x60) ? 0xdd86 : 0x00;
-
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		proto = (*buff_data & 0xf0);
+		pi.proto = (proto == 0x40) ? 0x0008 :
+				(proto == 0x60) ? 0xdd86 : 0x00;
+
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			/* Support only packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
+			if (seg_len < l234_hlen)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
-			tap_tx_offload(m_copy, mbuf->ol_flags,
-				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
+					l234_hlen);
+			tap_tx_l3_cksum(m_copy, mbuf->ol_flags,
+				       mbuf->l2_len, mbuf->l3_len, mbuf->l4_len,
+				       &l4_cksum, &l4_phdr_cksum,
+				       &l4_raw_cksum);
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_hlen;
+			k++;
+			/* Update next iovecs[] beyond l2, l3, l4 headers */
+			if (seg_len > l234_hlen) {
+				iovecs[k].iov_len = seg_len - l234_hlen;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_hlen;
+				tap_tx_l4_add_rcksum(iovecs[k].iov_base,
+					iovecs[k].iov_len, l4_cksum,
+					&l4_raw_cksum);
+				k++;
+				nb_segs++;
+			}
+			seg = seg->next;
 		}
+		for (j = k; j <= nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			tap_tx_l4_add_rcksum(iovecs[k].iov_base,
+				iovecs[k].iov_len, l4_cksum,
+				&l4_raw_cksum);
+			seg = seg->next;
+		}
+
+		tap_tx_l4_cksum(l4_cksum, l4_phdr_cksum, l4_raw_cksum);
+
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-04-22 11:30       ` [dpdk-dev] [PATCH v2 0/2] TAP TSO Ophir Munk
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-04-22 11:30         ` Ophir Munk
  1 sibling, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-04-22 11:30 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
librte_gso library is used to segment large TCP payloads (e.g. packets
of 64K bytes size) into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TCP TSO segments is limited to 64.

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 159 +++++++++++++++++++++++++++++++++++-------
 drivers/net/tap/rte_eth_tap.h |   4 ++
 mk/rte.app.mk                 |   4 +-
 4 files changed, 139 insertions(+), 30 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index d77a64f..fe62ab3 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -17,6 +17,7 @@
 #include <rte_ip.h>
 #include <rte_string_fns.h>
 
+#include <assert.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
@@ -408,7 +409,8 @@ tap_tx_offload_get_port_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 static uint64_t
@@ -417,7 +419,8 @@ tap_tx_offload_get_queue_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 static bool
@@ -500,23 +503,15 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
 	int i;
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
-
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
 		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
@@ -524,17 +519,13 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int proto;
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
-		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t seg_len; /* length of first segment */
 		uint16_t nb_segs;
 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
 		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
 		l4_cksum = NULL;
 		/*
 		 * TUN and TAP are created with IFF_NO_PI disabled.
@@ -567,9 +558,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
 			if (seg_len < l234_hlen)
 				break;
-
-			/* To change checksums, work on a
-			 * copy of l2, l3 l4 headers.
+			/* To change checksums, work on a * copy of l2, l3
+			 * headers + l4 pseudo header
 			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
 					l234_hlen);
@@ -609,13 +599,78 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
+
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint16_t tso_segsz = 0;
+	uint16_t hdrs_len;
+	uint32_t max_size;
+	int i;
+	uint64_t tso;
+	int ret;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+			assert(gso_ctx != NULL);
+			/* gso size is calculated without ETHER_CRC_LEN */
+			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
+					mbuf_in->l4_len;
+			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
+			if (unlikely(tso_segsz == hdrs_len) ||
+				tso_segsz > max_size) {
+				txq->stats.errs++;
+				break;
+			}
+			gso_ctx->gso_size = tso_segsz;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
 
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		rte_pktmbuf_free(mbuf_in);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -1064,32 +1119,77 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 	return 0;
 }
 
+#define TAP_GSO_MBUFS_NUM 64
+#define TAP_GSO_MBUF_SEG_SIZE 128
+
+static int
+tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
+{
+	uint32_t gso_types;
+	char pool_name[64];
+
+	/* Create private mbuf pool with 128 bytes size per mbuf
+	 * use this pool for both direct and indirect mbufs
+	 */
+
+	struct rte_mempool *mp;      /* Mempool for GSO packets */
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO | DEV_TX_OFFLOAD_VXLAN_TNL_TSO |
+		DEV_TX_OFFLOAD_GRE_TNL_TSO;
+	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
+	mp = rte_mempool_lookup((const char *)pool_name);
+	if (!mp) {
+		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
+			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
+			SOCKET_ID_ANY);
+		if (!mp) {
+			struct pmd_internals *pmd = dev->data->dev_private;
+			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
+				pmd->name, dev->device->name);
+			return -1;
+		}
+	}
+
+	gso_ctx->direct_pool = mp;
+	gso_ctx->indirect_pool = mp;
+	gso_ctx->gso_types = gso_types;
+	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
+	gso_ctx->flag = 0;
+
+	return 0;
+}
+
 static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
 	struct pmd_internals *pmd = dev->data->dev_private;
 	struct rx_queue *rx = &internals->rxq[qid];
 	struct tx_queue *tx = &internals->txq[qid];
+	struct rte_gso_ctx *gso_ctx;
 
 	if (is_rx) {
 		fd = &rx->fd;
 		other_fd = &tx->fd;
 		dir = "rx";
+		gso_ctx = NULL;
 	} else {
 		fd = &tx->fd;
 		other_fd = &rx->fd;
 		dir = "tx";
+		gso_ctx = &tx->gso_ctx;
 	}
 	if (*fd != -1) {
 		/* fd for this queue already exists */
 		RTE_LOG(DEBUG, PMD, "%s: fd %d for %s queue qid %d exists\n",
 			pmd->name, *fd, dir, qid);
+		gso_ctx = NULL;
 	} else if (*other_fd != -1) {
 		/* Only other_fd exists. dup it */
 		*fd = dup(*other_fd);
@@ -1116,6 +1216,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
 
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
+	if (gso_ctx) {
+		ret = tap_gso_ctx_setup(gso_ctx, dev);
+		if (ret)
+			return -1;
+	}
 
 	return *fd;
 }
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 25b65bf..69f746f 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 
 #ifdef IFF_MULTI_QUEUE
 #define RTE_PMD_TAP_MAX_QUEUES	TAP_MAX_QUEUES
@@ -22,6 +23,8 @@
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
 
+#define MAX_GSO_MBUFS 64
+
 struct pkt_stats {
 	uint64_t opackets;              /* Number of output packets */
 	uint64_t ipackets;              /* Number of input packets */
@@ -50,6 +53,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 0e18d0f..cd09dc6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -66,8 +66,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
 _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
 # librte_acl needs --whole-archive because of weak functions
@@ -85,6 +83,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-y += --whole-archive
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] TAP TSO
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-05-07 21:54           ` Ophir Munk
  2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  2018-05-31 13:52           ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ferruh Yigit
  1 sibling, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-05-07 21:54 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

v1: 
- Initial release
v2: 
- Fixing cksum errors
- TCP segment size refers to TCP payload size (not including l2,l3,l4 headers)
v3: 
- Bug fixing in case input mbuf is segmented
- Following review comments by Raslan Darawsha

This patch implements TAP TSO (TSP segmentation offload) in SW.
It uses dpdk library librte_gso.
Dpdk librte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This patch includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated TCP TSO may be composed of two segments where the first segment
includes l2,l3,l4 headers.
2. TAP TSO implementation: calling rte_gso_segment() to segment large TCP packets.
This commits creates of a small private mbuf pool in TAP PMD required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.
TSO segments size refers to TCP payload size (not including l2,l3,l4 headers)
librte_gso supports TCP segmentation above IPv4
Ophir Munk (2):
  net/tap: calculate checksums of multi segs packets
  net/tap: support TSO (TCP Segment Offload)

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 308 ++++++++++++++++++++++++++++++++----------
 drivers/net/tap/rte_eth_tap.h |   4 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 243 insertions(+), 75 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] net/tap: calculate checksums of multi segs packets
  2018-05-07 21:54           ` [dpdk-dev] [PATCH v3 0/2] TAP TSO Ophir Munk
@ 2018-05-07 21:54             ` Ophir Munk
  2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-05-07 21:54 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

Prior to this commit IP/UDP/TCP checksum offload calculations
were skipped in case of a multi segments packet.
This commit enables TAP checksum calculations for multi segments
packets.
The only restriction is that the first segment must contain
headers of layers 3 (IP) and 4 (UDP or TCP)

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 158 +++++++++++++++++++++++++++++-------------
 1 file changed, 108 insertions(+), 50 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index 172a7ba..538acae 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -424,12 +424,43 @@ tap_txq_are_offloads_valid(struct rte_eth_dev *dev, uint64_t offloads)
 	return true;
 }
 
+/* Finalize l4 checksum calculation */
 static void
-tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
-	       unsigned int l3_len)
+tap_tx_l4_cksum(uint16_t *l4_cksum, uint16_t l4_phdr_cksum,
+		uint32_t l4_raw_cksum)
 {
-	void *l3_hdr = packet + l2_len;
+	if (l4_cksum) {
+		uint32_t cksum;
+
+		cksum = __rte_raw_cksum_reduce(l4_raw_cksum);
+		cksum += l4_phdr_cksum;
+
+		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
+		cksum = (~cksum) & 0xffff;
+		if (cksum == 0)
+			cksum = 0xffff;
+		*l4_cksum = cksum;
+	}
+}
 
+/* Accumaulate L4 raw checksums */
+static void
+tap_tx_l4_add_rcksum(char *l4_data, unsigned int l4_len, uint16_t *l4_cksum,
+			uint32_t *l4_raw_cksum)
+{
+	if (l4_cksum == NULL)
+		return;
+
+	*l4_raw_cksum = __rte_raw_cksum(l4_data, l4_len, *l4_raw_cksum);
+}
+
+/* L3 and L4 pseudo headers checksum offloads */
+static void
+tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
+		unsigned int l3_len, unsigned int l4_len, uint16_t **l4_cksum,
+		uint16_t *l4_phdr_cksum, uint32_t *l4_raw_cksum)
+{
+	void *l3_hdr = packet + l2_len;
 	if (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4)) {
 		struct ipv4_hdr *iph = l3_hdr;
 		uint16_t cksum;
@@ -439,38 +470,21 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 		iph->hdr_checksum = (cksum == 0xffff) ? cksum : ~cksum;
 	}
 	if (ol_flags & PKT_TX_L4_MASK) {
-		uint16_t l4_len;
-		uint32_t cksum;
-		uint16_t *l4_cksum;
 		void *l4_hdr;
 
 		l4_hdr = packet + l2_len + l3_len;
 		if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM)
-			l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
+			*l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
 		else if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM)
-			l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
+			*l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
 		else
 			return;
-		*l4_cksum = 0;
-		if (ol_flags & PKT_TX_IPV4) {
-			struct ipv4_hdr *iph = l3_hdr;
-
-			l4_len = rte_be_to_cpu_16(iph->total_length) - l3_len;
-			cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
-		} else {
-			struct ipv6_hdr *ip6h = l3_hdr;
-
-			/* payload_len does not include ext headers */
-			l4_len = rte_be_to_cpu_16(ip6h->payload_len) -
-				l3_len + sizeof(struct ipv6_hdr);
-			cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
-		}
-		cksum += rte_raw_cksum(l4_hdr, l4_len);
-		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
-		cksum = (~cksum) & 0xffff;
-		if (cksum == 0)
-			cksum = 0xffff;
-		*l4_cksum = cksum;
+		**l4_cksum = 0;
+		if (ol_flags & PKT_TX_IPV4)
+			*l4_phdr_cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
+		else
+			*l4_phdr_cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
+		*l4_raw_cksum = __rte_raw_cksum(l4_hdr, l4_len, 0);
 	}
 }
 
@@ -491,17 +505,26 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
 	for (i = 0; i < nb_pkts; i++) {
 		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
+		int proto;
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
+		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
+		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
+		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
+		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
 			break;
-
+		l4_cksum = NULL;
 		/*
 		 * TUN and TAP are created with IFF_NO_PI disabled.
 		 * For TUN PMD this mandatory as fields are used by
@@ -513,34 +536,69 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		 * value 0x00 is taken for protocol field.
 		 */
 		char *buff_data = rte_pktmbuf_mtod(seg, void *);
-		j = (*buff_data & 0xf0);
-		pi.proto = (j == 0x40) ? 0x0008 :
-				(j == 0x60) ? 0xdd86 : 0x00;
-
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		proto = (*buff_data & 0xf0);
+		pi.proto = (proto == 0x40) ? 0x0008 :
+				(proto == 0x60) ? 0xdd86 : 0x00;
+
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			is_cksum = 1;
+			/* Support only packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
+			if (seg_len < l234_hlen)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
-			tap_tx_offload(m_copy, mbuf->ol_flags,
-				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
+					l234_hlen);
+			tap_tx_l3_cksum(m_copy, mbuf->ol_flags,
+				       mbuf->l2_len, mbuf->l3_len, mbuf->l4_len,
+				       &l4_cksum, &l4_phdr_cksum,
+				       &l4_raw_cksum);
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_hlen;
+			k++;
+			/* Update next iovecs[] beyond l2, l3, l4 headers */
+			if (seg_len > l234_hlen) {
+				iovecs[k].iov_len = seg_len - l234_hlen;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_hlen;
+				tap_tx_l4_add_rcksum(iovecs[k].iov_base,
+					iovecs[k].iov_len, l4_cksum,
+					&l4_raw_cksum);
+				k++;
+				nb_segs++;
+			}
+			seg = seg->next;
 		}
+		for (j = k; j <= nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			if (is_cksum)
+				tap_tx_l4_add_rcksum(iovecs[j].iov_base,
+					iovecs[j].iov_len, l4_cksum,
+					&l4_raw_cksum);
+			seg = seg->next;
+		}
+
+		if (is_cksum)
+			tap_tx_l4_cksum(l4_cksum, l4_phdr_cksum, l4_raw_cksum);
+
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-05-07 21:54           ` [dpdk-dev] [PATCH v3 0/2] TAP TSO Ophir Munk
  2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-05-07 21:54             ` Ophir Munk
  1 sibling, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-05-07 21:54 UTC (permalink / raw)
  To: dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
librte_gso library is used to segment large TCP payloads (e.g. packets
of 64K bytes size) into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TCP TSO segments is limited to 64.

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 158 +++++++++++++++++++++++++++++++++++-------
 drivers/net/tap/rte_eth_tap.h |   4 ++
 mk/rte.app.mk                 |   4 +-
 4 files changed, 139 insertions(+), 29 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index 538acae..c535418 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -17,6 +17,7 @@
 #include <rte_ip.h>
 #include <rte_string_fns.h>
 
+#include <assert.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
@@ -54,6 +55,9 @@
 #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
 #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|" ETH_TAP_USR_MAC_FMT
 
+#define TAP_GSO_MBUFS_NUM	128
+#define TAP_GSO_MBUF_SEG_SIZE	128
+
 static struct rte_vdev_driver pmd_tap_drv;
 static struct rte_vdev_driver pmd_tun_drv;
 
@@ -405,7 +409,8 @@ tap_tx_offload_get_queue_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 static bool
@@ -488,23 +493,15 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
 	int i;
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
-
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
 		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
@@ -512,8 +509,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int proto;
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
-		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t seg_len; /* length of first segment */
 		uint16_t nb_segs;
 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
@@ -521,9 +517,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
 		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
 		l4_cksum = NULL;
 		/*
 		 * TUN and TAP are created with IFF_NO_PI disabled.
@@ -557,9 +550,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
 			if (seg_len < l234_hlen)
 				break;
-
-			/* To change checksums, work on a
-			 * copy of l2, l3 l4 headers.
+			/* To change checksums, work on a * copy of l2, l3
+			 * headers + l4 pseudo header
 			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
 					l234_hlen);
@@ -601,13 +593,80 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
 
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint16_t tso_segsz = 0;
+	uint16_t hdrs_len;
+	uint32_t max_size;
+	int i;
+	uint64_t tso;
+	int ret;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+			assert(gso_ctx != NULL);
+			/* TCP segmentation implies TCP checksum offload */
+			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;
+			/* gso size is calculated without ETHER_CRC_LEN */
+			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
+					mbuf_in->l4_len;
+			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
+			if (unlikely(tso_segsz == hdrs_len) ||
+				tso_segsz > *txq->mtu) {
+				txq->stats.errs++;
+				break;
+			}
+			gso_ctx->gso_size = tso_segsz;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
+
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		rte_pktmbuf_free(mbuf_in);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -1060,31 +1119,73 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 }
 
 static int
+tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
+{
+	uint32_t gso_types;
+	char pool_name[64];
+
+	/*
+	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
+	 * size per mbuf use this pool for both direct and indirect mbufs
+	 */
+
+	struct rte_mempool *mp;      /* Mempool for GSO packets */
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
+	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
+	mp = rte_mempool_lookup((const char *)pool_name);
+	if (!mp) {
+		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
+			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
+			SOCKET_ID_ANY);
+		if (!mp) {
+			struct pmd_internals *pmd = dev->data->dev_private;
+			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
+				pmd->name, dev->device->name);
+			return -1;
+		}
+	}
+
+	gso_ctx->direct_pool = mp;
+	gso_ctx->indirect_pool = mp;
+	gso_ctx->gso_types = gso_types;
+	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
+	gso_ctx->flag = 0;
+
+	return 0;
+}
+
+static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
 	struct pmd_internals *pmd = dev->data->dev_private;
 	struct rx_queue *rx = &internals->rxq[qid];
 	struct tx_queue *tx = &internals->txq[qid];
+	struct rte_gso_ctx *gso_ctx;
 
 	if (is_rx) {
 		fd = &rx->fd;
 		other_fd = &tx->fd;
 		dir = "rx";
+		gso_ctx = NULL;
 	} else {
 		fd = &tx->fd;
 		other_fd = &rx->fd;
 		dir = "tx";
+		gso_ctx = &tx->gso_ctx;
 	}
 	if (*fd != -1) {
 		/* fd for this queue already exists */
 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
 			pmd->name, *fd, dir, qid);
+		gso_ctx = NULL;
 	} else if (*other_fd != -1) {
 		/* Only other_fd exists. dup it */
 		*fd = dup(*other_fd);
@@ -1109,6 +1210,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
 
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
+	if (gso_ctx) {
+		ret = tap_gso_ctx_setup(gso_ctx, dev);
+		if (ret)
+			return -1;
+	}
 
 	return *fd;
 }
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 67c9d4b..babe42d 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 #include "tap_log.h"
 
 #ifdef IFF_MULTI_QUEUE
@@ -23,6 +24,8 @@
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
 
+#define MAX_GSO_MBUFS 64
+
 struct pkt_stats {
 	uint64_t opackets;              /* Number of output packets */
 	uint64_t ipackets;              /* Number of input packets */
@@ -51,6 +54,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 29a2a60..62819a4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -66,8 +66,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
 _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
 # librte_acl needs --whole-archive because of weak functions
@@ -85,6 +83,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-y += --whole-archive
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets
  2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-05-07 21:54           ` [dpdk-dev] [PATCH v3 0/2] TAP TSO Ophir Munk
@ 2018-05-31 13:52           ` Ferruh Yigit
  2018-05-31 13:54             ` Ferruh Yigit
  1 sibling, 1 reply; 31+ messages in thread
From: Ferruh Yigit @ 2018-05-31 13:52 UTC (permalink / raw)
  To: Ophir Munk, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern

On 4/22/2018 12:30 PM, Ophir Munk wrote:
> Prior to this commit IP/UDP/TCP checksum offload calculations
> were skipped in case of a multi segments packet.
> This commit enables TAP checksum calculations for multi segments
> packets.
> The only restriction is that the first segment must contain
> headers of layers 3 (IP) and 4 (UDP or TCP)
> 
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>

Hi Ophir,

Can you please rebase the patch on top of latest master, it doesn't applies cleanly.

This is an feature from previous release, please send updates early so that we
can get this early into this release.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets
  2018-05-31 13:52           ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ferruh Yigit
@ 2018-05-31 13:54             ` Ferruh Yigit
  0 siblings, 0 replies; 31+ messages in thread
From: Ferruh Yigit @ 2018-05-31 13:54 UTC (permalink / raw)
  To: Ophir Munk, dev, Pascal Mazon; +Cc: Thomas Monjalon, Olga Shern

On 5/31/2018 2:52 PM, Ferruh Yigit wrote:
> On 4/22/2018 12:30 PM, Ophir Munk wrote:
>> Prior to this commit IP/UDP/TCP checksum offload calculations
>> were skipped in case of a multi segments packet.
>> This commit enables TAP checksum calculations for multi segments
>> packets.
>> The only restriction is that the first segment must contain
>> headers of layers 3 (IP) and 4 (UDP or TCP)
>>
>> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> 
> Hi Ophir,
> 
> Can you please rebase the patch on top of latest master, it doesn't applies cleanly.
> 
> This is an feature from previous release, please send updates early so that we
> can get this early into this release.

Opps, I replied to v2 instead of v3. But tested latest version, v3, and need v4.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v4 0/2] TAP TSO
  2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
  2018-04-09 22:33   ` [dpdk-dev] [PATCH v1 0/2] TAP TSO Ophir Munk
@ 2018-06-12 16:31   ` Ophir Munk
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-06-12 16:31 UTC (permalink / raw)
  To: dev, Pascal Mazon, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

v1: 
- Initial release

v2: 
- Fixing cksum errors
- TCP segment size refers to TCP payload size (not including l2,l3,l4 headers)

v3 (8 May 2018):
- Bug fixing in case input mbuf is segmented
- Following review comments by Raslan Darawsha

This patch implements TAP TSO (TSP segmentation offload) in SW.
It uses dpdk library librte_gso.
Dpdk librte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This patch includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated TCP TSO may be composed of two segments where the first segment
includes l2,l3,l4 headers.
2. TAP TSO implementation: calling rte_gso_segment() to segment large TCP packets.
This commits creates of a small private mbuf pool in TAP PMD required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.
TSO segments size refers to TCP payload size (not including l2,l3,l4 headers)
librte_gso supports TCP segmentation above IPv4

The serie was marked as suppressed before 18.05 release in order to include
it in 18.08.

v4 (12 Jun 2018):
Updates following a rebase on top of v18.05


Ophir Munk (2):
  net/tap: calculate checksums of multi segs packets
  net/tap: support TSO (TCP Segment Offload)

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 309 ++++++++++++++++++++++++++++++++----------
 drivers/net/tap/rte_eth_tap.h |   3 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 244 insertions(+), 74 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets
  2018-06-12 16:31   ` [dpdk-dev] [PATCH v4 0/2] TAP TSO Ophir Munk
@ 2018-06-12 16:31     ` Ophir Munk
  2018-06-12 17:17       ` Wiles, Keith
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 1 reply; 31+ messages in thread
From: Ophir Munk @ 2018-06-12 16:31 UTC (permalink / raw)
  To: dev, Pascal Mazon, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

Prior to this commit IP/UDP/TCP checksum offload calculations
were skipped in case of a multi segments packet.
This commit enables TAP checksum calculations for multi segments
packets.
The only restriction is that the first segment must contain
headers of layers 3 (IP) and 4 (UDP or TCP)

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 158 +++++++++++++++++++++++++++++-------------
 1 file changed, 110 insertions(+), 48 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index df396bf..c19f053 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -415,12 +415,43 @@ tap_tx_offload_get_queue_capa(void)
 	       DEV_TX_OFFLOAD_TCP_CKSUM;
 }
 
+/* Finalize l4 checksum calculation */
 static void
-tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
-	       unsigned int l3_len)
+tap_tx_l4_cksum(uint16_t *l4_cksum, uint16_t l4_phdr_cksum,
+		uint32_t l4_raw_cksum)
 {
-	void *l3_hdr = packet + l2_len;
+	if (l4_cksum) {
+		uint32_t cksum;
+
+		cksum = __rte_raw_cksum_reduce(l4_raw_cksum);
+		cksum += l4_phdr_cksum;
+
+		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
+		cksum = (~cksum) & 0xffff;
+		if (cksum == 0)
+			cksum = 0xffff;
+		*l4_cksum = cksum;
+	}
+}
 
+/* Accumaulate L4 raw checksums */
+static void
+tap_tx_l4_add_rcksum(char *l4_data, unsigned int l4_len, uint16_t *l4_cksum,
+			uint32_t *l4_raw_cksum)
+{
+	if (l4_cksum == NULL)
+		return;
+
+	*l4_raw_cksum = __rte_raw_cksum(l4_data, l4_len, *l4_raw_cksum);
+}
+
+/* L3 and L4 pseudo headers checksum offloads */
+static void
+tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
+		unsigned int l3_len, unsigned int l4_len, uint16_t **l4_cksum,
+		uint16_t *l4_phdr_cksum, uint32_t *l4_raw_cksum)
+{
+	void *l3_hdr = packet + l2_len;
 	if (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4)) {
 		struct ipv4_hdr *iph = l3_hdr;
 		uint16_t cksum;
@@ -430,38 +461,21 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 		iph->hdr_checksum = (cksum == 0xffff) ? cksum : ~cksum;
 	}
 	if (ol_flags & PKT_TX_L4_MASK) {
-		uint16_t l4_len;
-		uint32_t cksum;
-		uint16_t *l4_cksum;
 		void *l4_hdr;
 
 		l4_hdr = packet + l2_len + l3_len;
 		if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM)
-			l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
+			*l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
 		else if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM)
-			l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
+			*l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
 		else
 			return;
-		*l4_cksum = 0;
-		if (ol_flags & PKT_TX_IPV4) {
-			struct ipv4_hdr *iph = l3_hdr;
-
-			l4_len = rte_be_to_cpu_16(iph->total_length) - l3_len;
-			cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
-		} else {
-			struct ipv6_hdr *ip6h = l3_hdr;
-
-			/* payload_len does not include ext headers */
-			l4_len = rte_be_to_cpu_16(ip6h->payload_len) -
-				l3_len + sizeof(struct ipv6_hdr);
-			cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
-		}
-		cksum += rte_raw_cksum(l4_hdr, l4_len);
-		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
-		cksum = (~cksum) & 0xffff;
-		if (cksum == 0)
-			cksum = 0xffff;
-		*l4_cksum = cksum;
+		**l4_cksum = 0;
+		if (ol_flags & PKT_TX_IPV4)
+			*l4_phdr_cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
+		else
+			*l4_phdr_cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
+		*l4_raw_cksum = __rte_raw_cksum(l4_hdr, l4_len, 0);
 	}
 }
 
@@ -482,17 +496,27 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
 	for (i = 0; i < nb_pkts; i++) {
 		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
+		int proto;
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
+		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
+		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
+		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
+		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
 			break;
 
+		l4_cksum = NULL;
 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
 			/*
 			 * TUN and TAP are created with IFF_NO_PI disabled.
@@ -505,35 +529,73 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			 * is 4 or 6, then protocol field is updated.
 			 */
 			char *buff_data = rte_pktmbuf_mtod(seg, void *);
-			j = (*buff_data & 0xf0);
-			pi.proto = (j == 0x40) ? rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
-				(j == 0x60) ? rte_cpu_to_be_16(ETHER_TYPE_IPv6) : 0x00;
+			proto = (*buff_data & 0xf0);
+			pi.proto = (proto == 0x40) ?
+				rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
+				((proto == 0x60) ?
+					rte_cpu_to_be_16(ETHER_TYPE_IPv6) :
+					0x00);
 		}
 
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			is_cksum = 1;
+			/* Support only packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
+			if (seg_len < l234_hlen)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
-			tap_tx_offload(m_copy, mbuf->ol_flags,
-				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
+					l234_hlen);
+			tap_tx_l3_cksum(m_copy, mbuf->ol_flags,
+				       mbuf->l2_len, mbuf->l3_len, mbuf->l4_len,
+				       &l4_cksum, &l4_phdr_cksum,
+				       &l4_raw_cksum);
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_hlen;
+			k++;
+			/* Update next iovecs[] beyond l2, l3, l4 headers */
+			if (seg_len > l234_hlen) {
+				iovecs[k].iov_len = seg_len - l234_hlen;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_hlen;
+				tap_tx_l4_add_rcksum(iovecs[k].iov_base,
+					iovecs[k].iov_len, l4_cksum,
+					&l4_raw_cksum);
+				k++;
+				nb_segs++;
+			}
+			seg = seg->next;
 		}
+		for (j = k; j <= nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			if (is_cksum)
+				tap_tx_l4_add_rcksum(iovecs[j].iov_base,
+					iovecs[j].iov_len, l4_cksum,
+					&l4_raw_cksum);
+			seg = seg->next;
+		}
+
+		if (is_cksum)
+			tap_tx_l4_cksum(l4_cksum, l4_phdr_cksum, l4_raw_cksum);
+
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-12 16:31   ` [dpdk-dev] [PATCH v4 0/2] TAP TSO Ophir Munk
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-06-12 16:31     ` Ophir Munk
  2018-06-12 17:22       ` Wiles, Keith
                         ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Ophir Munk @ 2018-06-12 16:31 UTC (permalink / raw)
  To: dev, Pascal Mazon, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
librte_gso library is used to segment large TCP payloads (e.g. packets
of 64K bytes size) into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TCP TSO segments is limited to 64.

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 159 +++++++++++++++++++++++++++++++++++-------
 drivers/net/tap/rte_eth_tap.h |   3 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 138 insertions(+), 30 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index c19f053..62b931f 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -17,6 +17,7 @@
 #include <rte_ip.h>
 #include <rte_string_fns.h>
 
+#include <assert.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
@@ -55,6 +56,9 @@
 #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
 #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|" ETH_TAP_USR_MAC_FMT
 
+#define TAP_GSO_MBUFS_NUM	128
+#define TAP_GSO_MBUF_SEG_SIZE	128
+
 static struct rte_vdev_driver pmd_tap_drv;
 static struct rte_vdev_driver pmd_tun_drv;
 
@@ -412,7 +416,8 @@ tap_tx_offload_get_queue_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 /* Finalize l4 checksum calculation */
@@ -479,23 +484,15 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
 	int i;
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
-
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
 		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
@@ -503,8 +500,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int proto;
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
-		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t seg_len; /* length of first segment */
 		uint16_t nb_segs;
 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
@@ -512,10 +508,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
 		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
-
 		l4_cksum = NULL;
 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
 			/*
@@ -554,9 +546,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
 			if (seg_len < l234_hlen)
 				break;
-
-			/* To change checksums, work on a
-			 * copy of l2, l3 l4 headers.
+			/* To change checksums, work on a * copy of l2, l3
+			 * headers + l4 pseudo header
 			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
 					l234_hlen);
@@ -598,13 +589,80 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
+
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint16_t tso_segsz = 0;
+	uint16_t hdrs_len;
+	uint32_t max_size;
+	int i;
+	uint64_t tso;
+	int ret;
 
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+			assert(gso_ctx != NULL);
+			/* TCP segmentation implies TCP checksum offload */
+			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;
+			/* gso size is calculated without ETHER_CRC_LEN */
+			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
+					mbuf_in->l4_len;
+			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
+			if (unlikely(tso_segsz == hdrs_len) ||
+				tso_segsz > *txq->mtu) {
+				txq->stats.errs++;
+				break;
+			}
+			gso_ctx->gso_size = tso_segsz;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
+
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		rte_pktmbuf_free(mbuf_in);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -1066,31 +1124,73 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 }
 
 static int
+tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
+{
+	uint32_t gso_types;
+	char pool_name[64];
+
+	/*
+	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
+	 * size per mbuf use this pool for both direct and indirect mbufs
+	 */
+
+	struct rte_mempool *mp;      /* Mempool for GSO packets */
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
+	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
+	mp = rte_mempool_lookup((const char *)pool_name);
+	if (!mp) {
+		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
+			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
+			SOCKET_ID_ANY);
+		if (!mp) {
+			struct pmd_internals *pmd = dev->data->dev_private;
+			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
+				pmd->name, dev->device->name);
+			return -1;
+		}
+	}
+
+	gso_ctx->direct_pool = mp;
+	gso_ctx->indirect_pool = mp;
+	gso_ctx->gso_types = gso_types;
+	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
+	gso_ctx->flag = 0;
+
+	return 0;
+}
+
+static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
 	struct pmd_internals *pmd = dev->data->dev_private;
 	struct rx_queue *rx = &internals->rxq[qid];
 	struct tx_queue *tx = &internals->txq[qid];
+	struct rte_gso_ctx *gso_ctx;
 
 	if (is_rx) {
 		fd = &rx->fd;
 		other_fd = &tx->fd;
 		dir = "rx";
+		gso_ctx = NULL;
 	} else {
 		fd = &tx->fd;
 		other_fd = &rx->fd;
 		dir = "tx";
+		gso_ctx = &tx->gso_ctx;
 	}
 	if (*fd != -1) {
 		/* fd for this queue already exists */
 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
 			pmd->name, *fd, dir, qid);
+		gso_ctx = NULL;
 	} else if (*other_fd != -1) {
 		/* Only other_fd exists. dup it */
 		*fd = dup(*other_fd);
@@ -1115,6 +1215,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
 
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
+	if (gso_ctx) {
+		ret = tap_gso_ctx_setup(gso_ctx, dev);
+		if (ret)
+			return -1;
+	}
 
 	tx->type = pmd->type;
 
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 7b21d0d..44e2773 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 #include "tap_log.h"
 
 #ifdef IFF_MULTI_QUEUE
@@ -22,6 +23,7 @@
 #else
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
+#define MAX_GSO_MBUFS 64
 
 enum rte_tuntap_type {
 	ETH_TUNTAP_TYPE_UNKNOWN,
@@ -59,6 +61,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 1e32c83..e2ee879 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -38,8 +38,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
 _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
 # librte_acl needs --whole-archive because of weak functions
@@ -61,6 +59,8 @@ endif
 _LDLIBS-y += --whole-archive
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-06-12 17:17       ` Wiles, Keith
  0 siblings, 0 replies; 31+ messages in thread
From: Wiles, Keith @ 2018-06-12 17:17 UTC (permalink / raw)
  To: Ophir Munk; +Cc: dev, Pascal Mazon, Thomas Monjalon, Olga Shern

A few formatting problems I have noticed. We can review the code logic in a meeting.

> On Jun 12, 2018, at 11:31 AM, Ophir Munk <ophirmu@mellanox.com> wrote:
> 
> Prior to this commit IP/UDP/TCP checksum offload calculations
> were skipped in case of a multi segments packet.
> This commit enables TAP checksum calculations for multi segments
> packets.
> The only restriction is that the first segment must contain
> headers of layers 3 (IP) and 4 (UDP or TCP)
> 
> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> ---
> drivers/net/tap/rte_eth_tap.c | 158 +++++++++++++++++++++++++++++-------------
> 1 file changed, 110 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
> index df396bf..c19f053 100644
> --- a/drivers/net/tap/rte_eth_tap.c
> +++ b/drivers/net/tap/rte_eth_tap.c
> @@ -415,12 +415,43 @@ tap_tx_offload_get_queue_capa(void)
> 	       DEV_TX_OFFLOAD_TCP_CKSUM;
> }
> 
> +/* Finalize l4 checksum calculation */
> static void
> -tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
> -	       unsigned int l3_len)
> +tap_tx_l4_cksum(uint16_t *l4_cksum, uint16_t l4_phdr_cksum,
> +		uint32_t l4_raw_cksum)
> {
> -	void *l3_hdr = packet + l2_len;
> +	if (l4_cksum) {
> +		uint32_t cksum;
> +
> +		cksum = __rte_raw_cksum_reduce(l4_raw_cksum);
> +		cksum += l4_phdr_cksum;
> +
> +		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
> +		cksum = (~cksum) & 0xffff;
> +		if (cksum == 0)
> +			cksum = 0xffff;
> +		*l4_cksum = cksum;
> +	}
> +}
> 
> +/* Accumaulate L4 raw checksums */
> +static void
> +tap_tx_l4_add_rcksum(char *l4_data, unsigned int l4_len, uint16_t *l4_cksum,
> +			uint32_t *l4_raw_cksum)
> +{
> +	if (l4_cksum == NULL)
> +		return;
> +
> +	*l4_raw_cksum = __rte_raw_cksum(l4_data, l4_len, *l4_raw_cksum);
> +}
> +
> +/* L3 and L4 pseudo headers checksum offloads */
> +static void
> +tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
> +		unsigned int l3_len, unsigned int l4_len, uint16_t **l4_cksum,
> +		uint16_t *l4_phdr_cksum, uint32_t *l4_raw_cksum)
> +{
> +	void *l3_hdr = packet + l2_len;

Needs a blank line here.

> 	if (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4)) {
> 		struct ipv4_hdr *iph = l3_hdr;
> 		uint16_t cksum;
> @@ -430,38 +461,21 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
> 		iph->hdr_checksum = (cksum == 0xffff) ? cksum : ~cksum;
> 	}
> 	if (ol_flags & PKT_TX_L4_MASK) {
> -		uint16_t l4_len;
> -		uint32_t cksum;
> -		uint16_t *l4_cksum;
> 		void *l4_hdr;
> 
> 		l4_hdr = packet + l2_len + l3_len;
> 		if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM)
> -			l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
> +			*l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
> 		else if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM)
> -			l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
> +			*l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
> 		else
> 			return;
> -		*l4_cksum = 0;
> -		if (ol_flags & PKT_TX_IPV4) {
> -			struct ipv4_hdr *iph = l3_hdr;
> -
> -			l4_len = rte_be_to_cpu_16(iph->total_length) - l3_len;
> -			cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
> -		} else {
> -			struct ipv6_hdr *ip6h = l3_hdr;
> -
> -			/* payload_len does not include ext headers */
> -			l4_len = rte_be_to_cpu_16(ip6h->payload_len) -
> -				l3_len + sizeof(struct ipv6_hdr);
> -			cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
> -		}
> -		cksum += rte_raw_cksum(l4_hdr, l4_len);
> -		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
> -		cksum = (~cksum) & 0xffff;
> -		if (cksum == 0)
> -			cksum = 0xffff;
> -		*l4_cksum = cksum;
> +		**l4_cksum = 0;
> +		if (ol_flags & PKT_TX_IPV4)
> +			*l4_phdr_cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
> +		else
> +			*l4_phdr_cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
> +		*l4_raw_cksum = __rte_raw_cksum(l4_hdr, l4_len, 0);
> 	}
> }
> 
> @@ -482,17 +496,27 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> 	for (i = 0; i < nb_pkts; i++) {
> 		struct rte_mbuf *mbuf = bufs[num_tx];
> -		struct iovec iovecs[mbuf->nb_segs + 1];
> +		struct iovec iovecs[mbuf->nb_segs + 2];
> 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
> 		struct rte_mbuf *seg = mbuf;
> 		char m_copy[mbuf->data_len];
> +		int proto;
> 		int n;
> 		int j;
> +		int k; /* first index in iovecs for copying segments */
> +		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
> +		uint16_t seg_len; /* length of first segment */
> +		uint16_t nb_segs;
> +		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
> +		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
> +		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
> +		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
> 
> 		/* stats.errs will be incremented */
> 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
> 			break;
> 
> +		l4_cksum = NULL;
> 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
> 			/*
> 			 * TUN and TAP are created with IFF_NO_PI disabled.
> @@ -505,35 +529,73 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 			 * is 4 or 6, then protocol field is updated.
> 			 */
> 			char *buff_data = rte_pktmbuf_mtod(seg, void *);
> -			j = (*buff_data & 0xf0);
> -			pi.proto = (j == 0x40) ? rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
> -				(j == 0x60) ? rte_cpu_to_be_16(ETHER_TYPE_IPv6) : 0x00;
> +			proto = (*buff_data & 0xf0);
> +			pi.proto = (proto == 0x40) ?
> +				rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
> +				((proto == 0x60) ?
> +					rte_cpu_to_be_16(ETHER_TYPE_IPv6) :
> +					0x00);
> 		}
> 
> -		iovecs[0].iov_base = &pi;
> -		iovecs[0].iov_len = sizeof(pi);
> -		for (j = 1; j <= mbuf->nb_segs; j++) {
> -			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
> -			iovecs[j].iov_base =
> -				rte_pktmbuf_mtod(seg, void *);
> -			seg = seg->next;
> -		}
> +		k = 0;
> +		iovecs[k].iov_base = &pi;
> +		iovecs[k].iov_len = sizeof(pi);
> +		k++;

Blank lines are always good.

> +		nb_segs = mbuf->nb_segs;
> 		if (txq->csum &&
> 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
> 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
> 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
> -			/* Support only packets with all data in the same seg */
> -			if (mbuf->nb_segs > 1)
> +			is_cksum = 1;

Blank line would be nice.

> +			/* Support only packets with at least layer 4
> +			 * header included in the first segment
> +			 */
> +			seg_len = rte_pktmbuf_data_len(mbuf);
> +			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
> +			if (seg_len < l234_hlen)
> 				break;
> -			/* To change checksums, work on a copy of data. */
> +
> +			/* To change checksums, work on a
> +			 * copy of l2, l3 l4 headers.
> +			 */
> 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
> -				   rte_pktmbuf_data_len(mbuf));
> -			tap_tx_offload(m_copy, mbuf->ol_flags,
> -				       mbuf->l2_len, mbuf->l3_len);
> -			iovecs[1].iov_base = m_copy;
> +					l234_hlen);
> +			tap_tx_l3_cksum(m_copy, mbuf->ol_flags,
> +				       mbuf->l2_len, mbuf->l3_len, mbuf->l4_len,
> +				       &l4_cksum, &l4_phdr_cksum,
> +				       &l4_raw_cksum);
> +			iovecs[k].iov_base = m_copy;
> +			iovecs[k].iov_len = l234_hlen;
> +			k++;

Some blank lines would make this a lot more readable, like one here.

> +			/* Update next iovecs[] beyond l2, l3, l4 headers */
> +			if (seg_len > l234_hlen) {
> +				iovecs[k].iov_len = seg_len - l234_hlen;
> +				iovecs[k].iov_base =
> +					rte_pktmbuf_mtod(seg, char *) +
> +						l234_hlen;
> +				tap_tx_l4_add_rcksum(iovecs[k].iov_base,
> +					iovecs[k].iov_len, l4_cksum,
> +					&l4_raw_cksum);
> +				k++;
> +				nb_segs++;
> +			}
> +			seg = seg->next;
> 		}

Another place a blank line would be nice.

> +		for (j = k; j <= nb_segs; j++) {
> +			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
> +			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
> +			if (is_cksum)
> +				tap_tx_l4_add_rcksum(iovecs[j].iov_base,
> +					iovecs[j].iov_len, l4_cksum,
> +					&l4_raw_cksum);
> +			seg = seg->next;
> +		}
> +
> +		if (is_cksum)
> +			tap_tx_l4_cksum(l4_cksum, l4_phdr_cksum, l4_raw_cksum);
> +
> 		/* copy the tx frame data */
> -		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
> +		n = writev(txq->fd, iovecs, j);
> 		if (n <= 0)
> 			break;
> 
> -- 
> 2.7.4
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
@ 2018-06-12 17:22       ` Wiles, Keith
  2018-06-13 16:04       ` Wiles, Keith
  2018-06-23 23:17       ` [dpdk-dev] [PATCH v5 0/2] TAP TSO Ophir Munk
  2 siblings, 0 replies; 31+ messages in thread
From: Wiles, Keith @ 2018-06-12 17:22 UTC (permalink / raw)
  To: Ophir Munk; +Cc: dev, Pascal Mazon, Thomas Monjalon, Olga Shern



> On Jun 12, 2018, at 11:31 AM, Ophir Munk <ophirmu@mellanox.com> wrote:
> 
> This commit implements TCP segmentation offload in TAP.
> librte_gso library is used to segment large TCP payloads (e.g. packets
> of 64K bytes size) into smaller MTU size buffers.
> By supporting TSO offload capability in software a TAP device can be used
> as a failsafe sub device and be paired with another PCI device which
> supports TSO capability in HW.
> 
> For more details on librte_gso implementation please refer to dpdk
> documentation.
> The number of newly generated TCP TSO segments is limited to 64.
> 
> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> ---
> drivers/net/tap/Makefile      |   2 +-
> drivers/net/tap/rte_eth_tap.c | 159 +++++++++++++++++++++++++++++++++++-------
> drivers/net/tap/rte_eth_tap.h |   3 +
> mk/rte.app.mk                 |   4 +-
> 4 files changed, 138 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
> index ccc5c5f..3243365 100644
> --- a/drivers/net/tap/Makefile
> +++ b/drivers/net/tap/Makefile
> @@ -24,7 +24,7 @@ CFLAGS += -I.
> CFLAGS += $(WERROR_FLAGS)
> LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
> -LDLIBS += -lrte_bus_vdev
> +LDLIBS += -lrte_bus_vdev -lrte_gso
> 
> CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
> 
> diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
> index c19f053..62b931f 100644
> --- a/drivers/net/tap/rte_eth_tap.c
> +++ b/drivers/net/tap/rte_eth_tap.c
> @@ -17,6 +17,7 @@
> #include <rte_ip.h>
> #include <rte_string_fns.h>
> 
> +#include <assert.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/socket.h>
> @@ -55,6 +56,9 @@
> #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
> #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|" ETH_TAP_USR_MAC_FMT
> 
> +#define TAP_GSO_MBUFS_NUM	128
> +#define TAP_GSO_MBUF_SEG_SIZE	128
> +
> static struct rte_vdev_driver pmd_tap_drv;
> static struct rte_vdev_driver pmd_tun_drv;
> 
> @@ -412,7 +416,8 @@ tap_tx_offload_get_queue_capa(void)
> 	return DEV_TX_OFFLOAD_MULTI_SEGS |
> 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
> 	       DEV_TX_OFFLOAD_UDP_CKSUM |
> -	       DEV_TX_OFFLOAD_TCP_CKSUM;
> +	       DEV_TX_OFFLOAD_TCP_CKSUM |
> +	       DEV_TX_OFFLOAD_TCP_TSO;
> }
> 
> /* Finalize l4 checksum calculation */
> @@ -479,23 +484,15 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
> 	}
> }
> 
> -/* Callback to handle sending packets from the tap interface
> - */
> -static uint16_t
> -pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +static inline void
> +tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
> +			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
> +			uint16_t *num_packets, unsigned long *num_tx_bytes)
> {
> -	struct tx_queue *txq = queue;
> -	uint16_t num_tx = 0;
> -	unsigned long num_tx_bytes = 0;
> -	uint32_t max_size;
> 	int i;
> 
> -	if (unlikely(nb_pkts == 0))
> -		return 0;
> -
> -	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> -	for (i = 0; i < nb_pkts; i++) {
> -		struct rte_mbuf *mbuf = bufs[num_tx];
> +	for (i = 0; i < num_mbufs; i++) {
> +		struct rte_mbuf *mbuf = pmbufs[i];
> 		struct iovec iovecs[mbuf->nb_segs + 2];
> 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
> 		struct rte_mbuf *seg = mbuf;
> @@ -503,8 +500,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		int proto;
> 		int n;
> 		int j;
> -		int k; /* first index in iovecs for copying segments */
> -		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
> +		int k; /* current index in iovecs for copying segments */
> 		uint16_t seg_len; /* length of first segment */
> 		uint16_t nb_segs;
> 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
> @@ -512,10 +508,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
> 		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
> 
> -		/* stats.errs will be incremented */
> -		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
> -			break;
> -
> 		l4_cksum = NULL;
> 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
> 			/*
> @@ -554,9 +546,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
> 			if (seg_len < l234_hlen)
> 				break;
> -
> -			/* To change checksums, work on a
> -			 * copy of l2, l3 l4 headers.
> +			/* To change checksums, work on a * copy of l2, l3
> +			 * headers + l4 pseudo header
> 			 */
> 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
> 					l234_hlen);
> @@ -598,13 +589,80 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		n = writev(txq->fd, iovecs, j);
> 		if (n <= 0)
> 			break;
> +		(*num_packets)++;
> +		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
> +	}
> +}
> +
> +/* Callback to handle sending packets from the tap interface
> + */
> +static uint16_t
> +pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct tx_queue *txq = queue;
> +	uint16_t num_tx = 0;
> +	uint16_t num_packets = 0;
> +	unsigned long num_tx_bytes = 0;
> +	uint16_t tso_segsz = 0;
> +	uint16_t hdrs_len;
> +	uint32_t max_size;
> +	int i;
> +	uint64_t tso;
> +	int ret;
> 
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
> +	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> +	for (i = 0; i < nb_pkts; i++) {
> +		struct rte_mbuf *mbuf_in = bufs[num_tx];
> +		struct rte_mbuf **mbuf;
> +		uint16_t num_mbufs;
> +
> +		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
> +		if (tso) {
> +			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;

Blank line here after declare.
> +			assert(gso_ctx != NULL);
> +			/* TCP segmentation implies TCP checksum offload */
> +			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;
> +			/* gso size is calculated without ETHER_CRC_LEN */
> +			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
> +					mbuf_in->l4_len;
> +			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
> +			if (unlikely(tso_segsz == hdrs_len) ||
> +				tso_segsz > *txq->mtu) {
> +				txq->stats.errs++;
> +				break;
> +			}
> +			gso_ctx->gso_size = tso_segsz;
> +			ret = rte_gso_segment(mbuf_in, /* packet to segment */
> +				gso_ctx, /* gso control block */
> +				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
> +				RTE_DIM(gso_mbufs)); /* max tso mbufs */
> +
> +			/* ret contains the number of new created mbufs */
> +			if (ret < 0)
> +				break;
> +
> +			mbuf = gso_mbufs;
> +			num_mbufs = ret;
> +		} else {
> +			/* stats.errs will be incremented */
> +			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
> +				break;
> +
> +			mbuf = &mbuf_in;
> +			num_mbufs = 1;
> +		}
> +
> +		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
> +				&num_packets, &num_tx_bytes);
> 		num_tx++;
> -		num_tx_bytes += mbuf->pkt_len;
> -		rte_pktmbuf_free(mbuf);
> +		rte_pktmbuf_free(mbuf_in);
> 	}
> 
> -	txq->stats.opackets += num_tx;
> +	txq->stats.opackets += num_packets;
> 	txq->stats.errs += nb_pkts - num_tx;
> 	txq->stats.obytes += num_tx_bytes;
> 
> @@ -1066,31 +1124,73 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
> }
> 
> static int
> +tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
> +{
> +	uint32_t gso_types;
> +	char pool_name[64];
> +
> +	/*
> +	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
> +	 * size per mbuf use this pool for both direct and indirect mbufs
> +	 */
> +
> +	struct rte_mempool *mp;      /* Mempool for GSO packets */

Blank line after declare.
> +	/* initialize GSO context */
> +	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
> +	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
> +	mp = rte_mempool_lookup((const char *)pool_name);
> +	if (!mp) {
> +		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
> +			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
> +			SOCKET_ID_ANY);
> +		if (!mp) {
> +			struct pmd_internals *pmd = dev->data->dev_private;
> +			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
> +				pmd->name, dev->device->name);
> +			return -1;
> +		}
> +	}
> +
> +	gso_ctx->direct_pool = mp;
> +	gso_ctx->indirect_pool = mp;
> +	gso_ctx->gso_types = gso_types;
> +	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
> +	gso_ctx->flag = 0;
> +
> +	return 0;
> +}
> +
> +static int
> tap_setup_queue(struct rte_eth_dev *dev,
> 		struct pmd_internals *internals,
> 		uint16_t qid,
> 		int is_rx)
> {
> +	int ret;
> 	int *fd;
> 	int *other_fd;
> 	const char *dir;
> 	struct pmd_internals *pmd = dev->data->dev_private;
> 	struct rx_queue *rx = &internals->rxq[qid];
> 	struct tx_queue *tx = &internals->txq[qid];
> +	struct rte_gso_ctx *gso_ctx;
> 
> 	if (is_rx) {
> 		fd = &rx->fd;
> 		other_fd = &tx->fd;
> 		dir = "rx";
> +		gso_ctx = NULL;
> 	} else {
> 		fd = &tx->fd;
> 		other_fd = &rx->fd;
> 		dir = "tx";
> +		gso_ctx = &tx->gso_ctx;
> 	}
> 	if (*fd != -1) {
> 		/* fd for this queue already exists */
> 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
> 			pmd->name, *fd, dir, qid);
> +		gso_ctx = NULL;
> 	} else if (*other_fd != -1) {
> 		/* Only other_fd exists. dup it */
> 		*fd = dup(*other_fd);
> @@ -1115,6 +1215,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
> 
> 	tx->mtu = &dev->data->mtu;
> 	rx->rxmode = &dev->data->dev_conf.rxmode;
> +	if (gso_ctx) {
> +		ret = tap_gso_ctx_setup(gso_ctx, dev);
> +		if (ret)
> +			return -1;
> +	}
> 
> 	tx->type = pmd->type;
> 
> diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
> index 7b21d0d..44e2773 100644
> --- a/drivers/net/tap/rte_eth_tap.h
> +++ b/drivers/net/tap/rte_eth_tap.h
> @@ -15,6 +15,7 @@
> 
> #include <rte_ethdev_driver.h>
> #include <rte_ether.h>
> +#include <rte_gso.h>
> #include "tap_log.h"
> 
> #ifdef IFF_MULTI_QUEUE
> @@ -22,6 +23,7 @@
> #else
> #define RTE_PMD_TAP_MAX_QUEUES	1
> #endif
> +#define MAX_GSO_MBUFS 64
> 
> enum rte_tuntap_type {
> 	ETH_TUNTAP_TYPE_UNKNOWN,
> @@ -59,6 +61,7 @@ struct tx_queue {
> 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
> 	uint16_t csum:1;                /* Enable checksum offloading */
> 	struct pkt_stats stats;         /* Stats for this TX queue */
> +	struct rte_gso_ctx gso_ctx;     /* GSO context */
> };
> 
> struct pmd_internals {
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 1e32c83..e2ee879 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -38,8 +38,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
> _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
> _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
> _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
> -_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> -_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
> _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
> # librte_acl needs --whole-archive because of weak functions
> @@ -61,6 +59,8 @@ endif
> _LDLIBS-y += --whole-archive
> 
> _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
> _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
> _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
> -- 
> 2.7.4
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  2018-06-12 17:22       ` Wiles, Keith
@ 2018-06-13 16:04       ` Wiles, Keith
  2018-06-14  7:59         ` Ophir Munk
  2018-06-23 23:17       ` [dpdk-dev] [PATCH v5 0/2] TAP TSO Ophir Munk
  2 siblings, 1 reply; 31+ messages in thread
From: Wiles, Keith @ 2018-06-13 16:04 UTC (permalink / raw)
  To: Ophir Munk; +Cc: DPDK, Pascal Mazon, Thomas Monjalon, Olga Shern



> On Jun 12, 2018, at 11:31 AM, Ophir Munk <ophirmu@mellanox.com> wrote:
> 
> This commit implements TCP segmentation offload in TAP.
> librte_gso library is used to segment large TCP payloads (e.g. packets
> of 64K bytes size) into smaller MTU size buffers.
> By supporting TSO offload capability in software a TAP device can be used
> as a failsafe sub device and be paired with another PCI device which
> supports TSO capability in HW.
> 
> For more details on librte_gso implementation please refer to dpdk
> documentation.
> The number of newly generated TCP TSO segments is limited to 64.
> 
> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> ---
> drivers/net/tap/Makefile      |   2 +-
> drivers/net/tap/rte_eth_tap.c | 159 +++++++++++++++++++++++++++++++++++-------
> drivers/net/tap/rte_eth_tap.h |   3 +
> mk/rte.app.mk                 |   4 +-
> 4 files changed, 138 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
> index ccc5c5f..3243365 100644
> --- a/drivers/net/tap/Makefile
> +++ b/drivers/net/tap/Makefile
> @@ -24,7 +24,7 @@ CFLAGS += -I.
> CFLAGS += $(WERROR_FLAGS)
> LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
> -LDLIBS += -lrte_bus_vdev
> +LDLIBS += -lrte_bus_vdev -lrte_gso
> 
> CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
> 
> diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
> index c19f053..62b931f 100644
> --- a/drivers/net/tap/rte_eth_tap.c
> +++ b/drivers/net/tap/rte_eth_tap.c
> @@ -17,6 +17,7 @@
> #include <rte_ip.h>
> #include <rte_string_fns.h>
> 
> +#include <assert.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/socket.h>
> @@ -55,6 +56,9 @@
> #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
> #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|" ETH_TAP_USR_MAC_FMT
> 
> +#define TAP_GSO_MBUFS_NUM	128
> +#define TAP_GSO_MBUF_SEG_SIZE	128
> +
> static struct rte_vdev_driver pmd_tap_drv;
> static struct rte_vdev_driver pmd_tun_drv;
> 
> @@ -412,7 +416,8 @@ tap_tx_offload_get_queue_capa(void)
> 	return DEV_TX_OFFLOAD_MULTI_SEGS |
> 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
> 	       DEV_TX_OFFLOAD_UDP_CKSUM |
> -	       DEV_TX_OFFLOAD_TCP_CKSUM;
> +	       DEV_TX_OFFLOAD_TCP_CKSUM |
> +	       DEV_TX_OFFLOAD_TCP_TSO;
> }
> 
> /* Finalize l4 checksum calculation */
> @@ -479,23 +484,15 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
> 	}
> }
> 
> -/* Callback to handle sending packets from the tap interface
> - */
> -static uint16_t
> -pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +static inline void
> +tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
> +			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
> +			uint16_t *num_packets, unsigned long *num_tx_bytes)
> {
> -	struct tx_queue *txq = queue;
> -	uint16_t num_tx = 0;
> -	unsigned long num_tx_bytes = 0;
> -	uint32_t max_size;
> 	int i;
> 
> -	if (unlikely(nb_pkts == 0))
> -		return 0;
> -
> -	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> -	for (i = 0; i < nb_pkts; i++) {
> -		struct rte_mbuf *mbuf = bufs[num_tx];
> +	for (i = 0; i < num_mbufs; i++) {
> +		struct rte_mbuf *mbuf = pmbufs[i];
> 		struct iovec iovecs[mbuf->nb_segs + 2];
> 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
> 		struct rte_mbuf *seg = mbuf;
> @@ -503,8 +500,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		int proto;
> 		int n;
> 		int j;
> -		int k; /* first index in iovecs for copying segments */
> -		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
> +		int k; /* current index in iovecs for copying segments */
> 		uint16_t seg_len; /* length of first segment */
> 		uint16_t nb_segs;
> 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
> @@ -512,10 +508,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
> 		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
> 
> -		/* stats.errs will be incremented */
> -		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
> -			break;
> -
> 		l4_cksum = NULL;
> 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
> 			/*
> @@ -554,9 +546,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
> 			if (seg_len < l234_hlen)
> 				break;
> -
> -			/* To change checksums, work on a
> -			 * copy of l2, l3 l4 headers.

Adding back in the blank line above would be nice for readability.
> +			/* To change checksums, work on a * copy of l2, l3
> +			 * headers + l4 pseudo header
> 			 */
> 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
> 					l234_hlen);
> @@ -598,13 +589,80 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> 		n = writev(txq->fd, iovecs, j);
> 		if (n <= 0)
> 			break;
> +		(*num_packets)++;
> +		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
> +	}
> +}
> +
> +/* Callback to handle sending packets from the tap interface
> + */
> +static uint16_t
> +pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct tx_queue *txq = queue;
> +	uint16_t num_tx = 0;
> +	uint16_t num_packets = 0;
> +	unsigned long num_tx_bytes = 0;
> +	uint16_t tso_segsz = 0;
> +	uint16_t hdrs_len;
> +	uint32_t max_size;
> +	int i;
> +	uint64_t tso;
> +	int ret;
> 
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
> +	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> +	for (i = 0; i < nb_pkts; i++) {
> +		struct rte_mbuf *mbuf_in = bufs[num_tx];
> +		struct rte_mbuf **mbuf;
> +		uint16_t num_mbufs;
> +
> +		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
> +		if (tso) {
> +			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;

Missing blank line here, this one is not optional.
> +			assert(gso_ctx != NULL);

Blank line would be nice.
> +			/* TCP segmentation implies TCP checksum offload */
> +			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;

Blank line would be nice.
> +			/* gso size is calculated without ETHER_CRC_LEN */
> +			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
> +					mbuf_in->l4_len;
> +			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
> +			if (unlikely(tso_segsz == hdrs_len) ||
> +				tso_segsz > *txq->mtu) {
> +				txq->stats.errs++;
> +				break;
> +			}
> +			gso_ctx->gso_size = tso_segsz;
> +			ret = rte_gso_segment(mbuf_in, /* packet to segment */
> +				gso_ctx, /* gso control block */
> +				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
> +				RTE_DIM(gso_mbufs)); /* max tso mbufs */
> +
> +			/* ret contains the number of new created mbufs */
> +			if (ret < 0)
> +				break;
> +
> +			mbuf = gso_mbufs;
> +			num_mbufs = ret;
> +		} else {
> +			/* stats.errs will be incremented */
> +			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
> +				break;
> +
> +			mbuf = &mbuf_in;
> +			num_mbufs = 1;
> +		}
> +
> +		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
> +				&num_packets, &num_tx_bytes);
> 		num_tx++;
> -		num_tx_bytes += mbuf->pkt_len;
> -		rte_pktmbuf_free(mbuf);
> +		rte_pktmbuf_free(mbuf_in);
> 	}
> 
> -	txq->stats.opackets += num_tx;
> +	txq->stats.opackets += num_packets;
> 	txq->stats.errs += nb_pkts - num_tx;
> 	txq->stats.obytes += num_tx_bytes;
> 
> @@ -1066,31 +1124,73 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
> }
> 
> static int
> +tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
> +{
> +	uint32_t gso_types;
> +	char pool_name[64];
> +
> +	/*
> +	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
> +	 * size per mbuf use this pool for both direct and indirect mbufs
> +	 */
> +
> +	struct rte_mempool *mp;      /* Mempool for GSO packets */
> +	/* initialize GSO context */
> +	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
> +	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
> +	mp = rte_mempool_lookup((const char *)pool_name);
> +	if (!mp) {
> +		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
> +			0, 0, RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
> +			SOCKET_ID_ANY);

You have setup the mempool with no cache size, which means you have to take a lock for each allocate. This could changed to have a small cache per lcore say 8, but the total number of mbufs needs to be large enough to not allow starvation for a lcore.  total_mbufs = (max_num_ports * cache_size) + some_extra mbufs;

> +		if (!mp) {
> +			struct pmd_internals *pmd = dev->data->dev_private;
> +			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
> +				pmd->name, dev->device->name);
> +			return -1;
> +		}
> +	}
> +
> +	gso_ctx->direct_pool = mp;
> +	gso_ctx->indirect_pool = mp;
> +	gso_ctx->gso_types = gso_types;
> +	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
> +	gso_ctx->flag = 0;
> +
> +	return 0;
> +}
> +
> +static int
> tap_setup_queue(struct rte_eth_dev *dev,
> 		struct pmd_internals *internals,
> 		uint16_t qid,
> 		int is_rx)
> {
> +	int ret;
> 	int *fd;
> 	int *other_fd;
> 	const char *dir;
> 	struct pmd_internals *pmd = dev->data->dev_private;
> 	struct rx_queue *rx = &internals->rxq[qid];
> 	struct tx_queue *tx = &internals->txq[qid];
> +	struct rte_gso_ctx *gso_ctx;
> 
> 	if (is_rx) {
> 		fd = &rx->fd;
> 		other_fd = &tx->fd;
> 		dir = "rx";
> +		gso_ctx = NULL;
> 	} else {
> 		fd = &tx->fd;
> 		other_fd = &rx->fd;
> 		dir = "tx";
> +		gso_ctx = &tx->gso_ctx;
> 	}
> 	if (*fd != -1) {
> 		/* fd for this queue already exists */
> 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
> 			pmd->name, *fd, dir, qid);
> +		gso_ctx = NULL;
> 	} else if (*other_fd != -1) {
> 		/* Only other_fd exists. dup it */
> 		*fd = dup(*other_fd);
> @@ -1115,6 +1215,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
> 
> 	tx->mtu = &dev->data->mtu;
> 	rx->rxmode = &dev->data->dev_conf.rxmode;
> +	if (gso_ctx) {
> +		ret = tap_gso_ctx_setup(gso_ctx, dev);
> +		if (ret)
> +			return -1;
> +	}
> 
> 	tx->type = pmd->type;
> 
> diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
> index 7b21d0d..44e2773 100644
> --- a/drivers/net/tap/rte_eth_tap.h
> +++ b/drivers/net/tap/rte_eth_tap.h
> @@ -15,6 +15,7 @@
> 
> #include <rte_ethdev_driver.h>
> #include <rte_ether.h>
> +#include <rte_gso.h>
> #include "tap_log.h"
> 
> #ifdef IFF_MULTI_QUEUE
> @@ -22,6 +23,7 @@
> #else
> #define RTE_PMD_TAP_MAX_QUEUES	1
> #endif
> +#define MAX_GSO_MBUFS 64
> 
> enum rte_tuntap_type {
> 	ETH_TUNTAP_TYPE_UNKNOWN,
> @@ -59,6 +61,7 @@ struct tx_queue {
> 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
> 	uint16_t csum:1;                /* Enable checksum offloading */
> 	struct pkt_stats stats;         /* Stats for this TX queue */
> +	struct rte_gso_ctx gso_ctx;     /* GSO context */
> };
> 
> struct pmd_internals {
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 1e32c83..e2ee879 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -38,8 +38,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
> _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
> _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
> _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
> -_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> -_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
> _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
> # librte_acl needs --whole-archive because of weak functions
> @@ -61,6 +59,8 @@ endif
> _LDLIBS-y += --whole-archive
> 
> _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
> _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
> _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
> -- 
> 2.7.4
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-13 16:04       ` Wiles, Keith
@ 2018-06-14  7:59         ` Ophir Munk
  2018-06-14 12:58           ` Wiles, Keith
  0 siblings, 1 reply; 31+ messages in thread
From: Ophir Munk @ 2018-06-14  7:59 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: DPDK, Pascal Mazon, Thomas Monjalon, Olga Shern



> -----Original Message-----
> From: Wiles, Keith [mailto:keith.wiles@intel.com]
> Sent: Wednesday, June 13, 2018 7:04 PM
> To: Ophir Munk <ophirmu@mellanox.com>
> Cc: DPDK <dev@dpdk.org>; Pascal Mazon <pascal.mazon@6wind.com>;
> Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> <olgas@mellanox.com>
> Subject: Re: [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
> 
> 
> 
> > On Jun 12, 2018, at 11:31 AM, Ophir Munk <ophirmu@mellanox.com>
> wrote:
> >
> > This commit implements TCP segmentation offload in TAP.
> > librte_gso library is used to segment large TCP payloads (e.g. packets
> > of 64K bytes size) into smaller MTU size buffers.
> > By supporting TSO offload capability in software a TAP device can be
> > used as a failsafe sub device and be paired with another PCI device
> > which supports TSO capability in HW.
> >
> > For more details on librte_gso implementation please refer to dpdk
> > documentation.
> > The number of newly generated TCP TSO segments is limited to 64.
> >
> > Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
> > Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> > ---
> > drivers/net/tap/Makefile      |   2 +-
> > drivers/net/tap/rte_eth_tap.c | 159
> +++++++++++++++++++++++++++++++++++-------
> > drivers/net/tap/rte_eth_tap.h |   3 +
> > mk/rte.app.mk                 |   4 +-
> > 4 files changed, 138 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile index
> > ccc5c5f..3243365 100644
> > --- a/drivers/net/tap/Makefile
> > +++ b/drivers/net/tap/Makefile
> > @@ -24,7 +24,7 @@ CFLAGS += -I.
> > CFLAGS += $(WERROR_FLAGS)
> > LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring LDLIBS +=
> > -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash -LDLIBS +=
> > -lrte_bus_vdev
> > +LDLIBS += -lrte_bus_vdev -lrte_gso
> >
> > CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
> >
> > diff --git a/drivers/net/tap/rte_eth_tap.c
> > b/drivers/net/tap/rte_eth_tap.c index c19f053..62b931f 100644
> > --- a/drivers/net/tap/rte_eth_tap.c
> > +++ b/drivers/net/tap/rte_eth_tap.c
> > @@ -17,6 +17,7 @@
> > #include <rte_ip.h>
> > #include <rte_string_fns.h>
> >
> > +#include <assert.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <sys/socket.h>
> > @@ -55,6 +56,9 @@
> > #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
> > #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|"
> ETH_TAP_USR_MAC_FMT
> >
> > +#define TAP_GSO_MBUFS_NUM	128
> > +#define TAP_GSO_MBUF_SEG_SIZE	128
> > +
> > static struct rte_vdev_driver pmd_tap_drv; static struct
> > rte_vdev_driver pmd_tun_drv;
> >
> > @@ -412,7 +416,8 @@ tap_tx_offload_get_queue_capa(void)
> > 	return DEV_TX_OFFLOAD_MULTI_SEGS |
> > 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
> > 	       DEV_TX_OFFLOAD_UDP_CKSUM |
> > -	       DEV_TX_OFFLOAD_TCP_CKSUM;
> > +	       DEV_TX_OFFLOAD_TCP_CKSUM |
> > +	       DEV_TX_OFFLOAD_TCP_TSO;
> > }
> >
> > /* Finalize l4 checksum calculation */ @@ -479,23 +484,15 @@
> > tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
> > 	}
> > }
> >
> > -/* Callback to handle sending packets from the tap interface
> > - */
> > -static uint16_t
> > -pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +static inline void
> > +tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
> > +			struct rte_mbuf **pmbufs, uint16_t l234_hlen,
> > +			uint16_t *num_packets, unsigned long
> *num_tx_bytes)
> > {
> > -	struct tx_queue *txq = queue;
> > -	uint16_t num_tx = 0;
> > -	unsigned long num_tx_bytes = 0;
> > -	uint32_t max_size;
> > 	int i;
> >
> > -	if (unlikely(nb_pkts == 0))
> > -		return 0;
> > -
> > -	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> > -	for (i = 0; i < nb_pkts; i++) {
> > -		struct rte_mbuf *mbuf = bufs[num_tx];
> > +	for (i = 0; i < num_mbufs; i++) {
> > +		struct rte_mbuf *mbuf = pmbufs[i];
> > 		struct iovec iovecs[mbuf->nb_segs + 2];
> > 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
> > 		struct rte_mbuf *seg = mbuf;
> > @@ -503,8 +500,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> > 		int proto;
> > 		int n;
> > 		int j;
> > -		int k; /* first index in iovecs for copying segments */
> > -		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
> > +		int k; /* current index in iovecs for copying segments */
> > 		uint16_t seg_len; /* length of first segment */
> > 		uint16_t nb_segs;
> > 		uint16_t *l4_cksum; /* l4 checksum (pseudo header +
> payload) */ @@
> > -512,10 +508,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs,
> uint16_t nb_pkts)
> > 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header
> checksum */
> > 		uint16_t is_cksum = 0; /* in case cksum should be offloaded
> */
> >
> > -		/* stats.errs will be incremented */
> > -		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
> > -			break;
> > -
> > 		l4_cksum = NULL;
> > 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
> > 			/*
> > @@ -554,9 +546,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> > 			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf-
> >l4_len;
> > 			if (seg_len < l234_hlen)
> > 				break;
> > -
> > -			/* To change checksums, work on a
> > -			 * copy of l2, l3 l4 headers.
> 
> Adding back in the blank line above would be nice for readability.
> > +			/* To change checksums, work on a * copy of l2, l3
> > +			 * headers + l4 pseudo header
> > 			 */
> > 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void
> *),
> > 					l234_hlen);
> > @@ -598,13 +589,80 @@ pmd_tx_burst(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> > 		n = writev(txq->fd, iovecs, j);
> > 		if (n <= 0)
> > 			break;
> > +		(*num_packets)++;
> > +		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
> > +	}
> > +}
> > +
> > +/* Callback to handle sending packets from the tap interface  */
> > +static uint16_t pmd_tx_burst(void *queue, struct rte_mbuf **bufs,
> > +uint16_t nb_pkts) {
> > +	struct tx_queue *txq = queue;
> > +	uint16_t num_tx = 0;
> > +	uint16_t num_packets = 0;
> > +	unsigned long num_tx_bytes = 0;
> > +	uint16_t tso_segsz = 0;
> > +	uint16_t hdrs_len;
> > +	uint32_t max_size;
> > +	int i;
> > +	uint64_t tso;
> > +	int ret;
> >
> > +	if (unlikely(nb_pkts == 0))
> > +		return 0;
> > +
> > +	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
> > +	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		struct rte_mbuf *mbuf_in = bufs[num_tx];
> > +		struct rte_mbuf **mbuf;
> > +		uint16_t num_mbufs;
> > +
> > +		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
> > +		if (tso) {
> > +			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
> 
> Missing blank line here, this one is not optional.
> > +			assert(gso_ctx != NULL);
> 
> Blank line would be nice.
> > +			/* TCP segmentation implies TCP checksum offload
> */
> > +			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;
> 
> Blank line would be nice.
> > +			/* gso size is calculated without ETHER_CRC_LEN */
> > +			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
> > +					mbuf_in->l4_len;
> > +			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
> > +			if (unlikely(tso_segsz == hdrs_len) ||
> > +				tso_segsz > *txq->mtu) {
> > +				txq->stats.errs++;
> > +				break;
> > +			}
> > +			gso_ctx->gso_size = tso_segsz;
> > +			ret = rte_gso_segment(mbuf_in, /* packet to
> segment */
> > +				gso_ctx, /* gso control block */
> > +				(struct rte_mbuf **)&gso_mbufs, /* out
> mbufs */
> > +				RTE_DIM(gso_mbufs)); /* max tso mbufs */
> > +
> > +			/* ret contains the number of new created mbufs */
> > +			if (ret < 0)
> > +				break;
> > +
> > +			mbuf = gso_mbufs;
> > +			num_mbufs = ret;
> > +		} else {
> > +			/* stats.errs will be incremented */
> > +			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
> > +				break;
> > +
> > +			mbuf = &mbuf_in;
> > +			num_mbufs = 1;
> > +		}
> > +
> > +		tap_write_mbufs(txq, num_mbufs, mbuf, hdrs_len,
> > +				&num_packets, &num_tx_bytes);
> > 		num_tx++;
> > -		num_tx_bytes += mbuf->pkt_len;
> > -		rte_pktmbuf_free(mbuf);
> > +		rte_pktmbuf_free(mbuf_in);
> > 	}
> >
> > -	txq->stats.opackets += num_tx;
> > +	txq->stats.opackets += num_packets;
> > 	txq->stats.errs += nb_pkts - num_tx;
> > 	txq->stats.obytes += num_tx_bytes;
> >
> > @@ -1066,31 +1124,73 @@ tap_mac_set(struct rte_eth_dev *dev, struct
> > ether_addr *mac_addr) }
> >
> > static int
> > +tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev
> > +*dev) {
> > +	uint32_t gso_types;
> > +	char pool_name[64];
> > +
> > +	/*
> > +	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
> > +	 * size per mbuf use this pool for both direct and indirect mbufs
> > +	 */
> > +
> > +	struct rte_mempool *mp;      /* Mempool for GSO packets */
> > +	/* initialize GSO context */
> > +	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
> > +	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device-
> >name);
> > +	mp = rte_mempool_lookup((const char *)pool_name);
> > +	if (!mp) {
> > +		mp = rte_pktmbuf_pool_create(pool_name,
> TAP_GSO_MBUFS_NUM,
> > +			0, 0, RTE_PKTMBUF_HEADROOM +
> TAP_GSO_MBUF_SEG_SIZE,
> > +			SOCKET_ID_ANY);
> 
> You have setup the mempool with no cache size, which means you have to
> take a lock for each allocate. This could changed to have a small cache per
> lcore say 8, but the total number of mbufs needs to be large enough to not
> allow starvation for a lcore.  total_mbufs = (max_num_ports * cache_size) +
> some_extra mbufs;
> 

I will set cache_size as 4. 
The total_mbufs should be mbufs_per_core(128) * cache_size(4) where the max_num_ports 
is already taken into consideration in mbufs_per_core. For example, for a TCP packet of 1024 bytes and TSO max seg size of 256 bytes GSO will allocate 5 mbufs (one direct and four indirect) regardless of the number of ports.

> > +		if (!mp) {
> > +			struct pmd_internals *pmd = dev->data-
> >dev_private;
> > +			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf
> pool for device %s\n",
> > +				pmd->name, dev->device->name);
> > +			return -1;
> > +		}
> > +	}
> > +
> > +	gso_ctx->direct_pool = mp;
> > +	gso_ctx->indirect_pool = mp;
> > +	gso_ctx->gso_types = gso_types;
> > +	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
> > +	gso_ctx->flag = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > tap_setup_queue(struct rte_eth_dev *dev,
> > 		struct pmd_internals *internals,
> > 		uint16_t qid,
> > 		int is_rx)
> > {
> > +	int ret;
> > 	int *fd;
> > 	int *other_fd;
> > 	const char *dir;
> > 	struct pmd_internals *pmd = dev->data->dev_private;
> > 	struct rx_queue *rx = &internals->rxq[qid];
> > 	struct tx_queue *tx = &internals->txq[qid];
> > +	struct rte_gso_ctx *gso_ctx;
> >
> > 	if (is_rx) {
> > 		fd = &rx->fd;
> > 		other_fd = &tx->fd;
> > 		dir = "rx";
> > +		gso_ctx = NULL;
> > 	} else {
> > 		fd = &tx->fd;
> > 		other_fd = &rx->fd;
> > 		dir = "tx";
> > +		gso_ctx = &tx->gso_ctx;
> > 	}
> > 	if (*fd != -1) {
> > 		/* fd for this queue already exists */
> > 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
> > 			pmd->name, *fd, dir, qid);
> > +		gso_ctx = NULL;
> > 	} else if (*other_fd != -1) {
> > 		/* Only other_fd exists. dup it */
> > 		*fd = dup(*other_fd);
> > @@ -1115,6 +1215,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
> >
> > 	tx->mtu = &dev->data->mtu;
> > 	rx->rxmode = &dev->data->dev_conf.rxmode;
> > +	if (gso_ctx) {
> > +		ret = tap_gso_ctx_setup(gso_ctx, dev);
> > +		if (ret)
> > +			return -1;
> > +	}
> >
> > 	tx->type = pmd->type;
> >
> > diff --git a/drivers/net/tap/rte_eth_tap.h
> > b/drivers/net/tap/rte_eth_tap.h index 7b21d0d..44e2773 100644
> > --- a/drivers/net/tap/rte_eth_tap.h
> > +++ b/drivers/net/tap/rte_eth_tap.h
> > @@ -15,6 +15,7 @@
> >
> > #include <rte_ethdev_driver.h>
> > #include <rte_ether.h>
> > +#include <rte_gso.h>
> > #include "tap_log.h"
> >
> > #ifdef IFF_MULTI_QUEUE
> > @@ -22,6 +23,7 @@
> > #else
> > #define RTE_PMD_TAP_MAX_QUEUES	1
> > #endif
> > +#define MAX_GSO_MBUFS 64
> >
> > enum rte_tuntap_type {
> > 	ETH_TUNTAP_TYPE_UNKNOWN,
> > @@ -59,6 +61,7 @@ struct tx_queue {
> > 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
> > 	uint16_t csum:1;                /* Enable checksum offloading */
> > 	struct pkt_stats stats;         /* Stats for this TX queue */
> > +	struct rte_gso_ctx gso_ctx;     /* GSO context */
> > };
> >
> > struct pmd_internals {
> > diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 1e32c83..e2ee879
> > 100644
> > --- a/mk/rte.app.mk
> > +++ b/mk/rte.app.mk
> > @@ -38,8 +38,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -
> lrte_port
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
> > -_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> > -_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
> > # librte_acl needs --whole-archive because of weak functions @@ -61,6
> > +59,8 @@ endif _LDLIBS-y += --whole-archive
> >
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
> > +_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
> > +_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
> > _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
> > --
> > 2.7.4
> >
> 
> Regards,
> Keith

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-14  7:59         ` Ophir Munk
@ 2018-06-14 12:58           ` Wiles, Keith
  0 siblings, 0 replies; 31+ messages in thread
From: Wiles, Keith @ 2018-06-14 12:58 UTC (permalink / raw)
  To: Ophir Munk; +Cc: DPDK, Pascal Mazon, Thomas Monjalon, Olga Shern



> On Jun 14, 2018, at 2:59 AM, Ophir Munk <ophirmu@mellanox.com> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: Wiles, Keith [mailto:keith.wiles@intel.com]
>> Sent: Wednesday, June 13, 2018 7:04 PM
>> To: Ophir Munk <ophirmu@mellanox.com>
>> Cc: DPDK <dev@dpdk.org>; Pascal Mazon <pascal.mazon@6wind.com>;
>> Thomas Monjalon <thomas@monjalon.net>; Olga Shern
>> <olgas@mellanox.com>
>> Subject: Re: [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload)
>> 
>> 
>> 
>>> On Jun 12, 2018, at 11:31 AM, Ophir Munk <ophirmu@mellanox.com>
>> wrote:
>>> 
>>> This commit implements TCP segmentation offload in TAP.
>>> librte_gso library is used to segment large TCP payloads (e.g. packets
>>> of 64K bytes size) into smaller MTU size buffers.
>>> By supporting TSO offload capability in software a TAP device can be
>>> used as a failsafe sub device and be paired with another PCI device
>>> which supports TSO capability in HW.
>>> 
>>> For more details on librte_gso implementation please refer to dpdk
>>> documentation.
>>> The number of newly generated TCP TSO segments is limited to 64.
>>> 
>>> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
>>> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
>>> ---
>>> drivers/net/tap/Makefile      |   2 +-
>>> drivers/net/tap/rte_eth_tap.c | 159
>> +++++++++++++++++++++++++++++++++++-------
>>> drivers/net/tap/rte_eth_tap.h |   3 +
>>> mk/rte.app.mk                 |   4 +-
>>> 4 files changed, 138 insertions(+), 30 deletions(-)
>> 
>> You have setup the mempool with no cache size, which means you have to
>> take a lock for each allocate. This could changed to have a small cache per
>> lcore say 8, but the total number of mbufs needs to be large enough to not
>> allow starvation for a lcore.  total_mbufs = (max_num_ports * cache_size) +
>> some_extra mbufs;
>> 
> 
> I will set cache_size as 4. 
> The total_mbufs should be mbufs_per_core(128) * cache_size(4) where the max_num_ports 
> is already taken into consideration in mbufs_per_core. For example, for a TCP packet of 1024 bytes and TSO max seg size of 256 bytes GSO will allocate 5 mbufs (one direct and four indirect) regardless of the number of ports.

Sounds good, thanks.


Regards,
Keith

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v5 0/2] TAP TSO
  2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  2018-06-12 17:22       ` Wiles, Keith
  2018-06-13 16:04       ` Wiles, Keith
@ 2018-06-23 23:17       ` Ophir Munk
  2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
  2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  2 siblings, 2 replies; 31+ messages in thread
From: Ophir Munk @ 2018-06-23 23:17 UTC (permalink / raw)
  To: dev, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

v1: 
- Initial release

v2: 
- Fixing cksum errors
- TCP segment size refers to TCP payload size (not including l2,l3,l4 headers)

v3 (8 May 2018):
- Bug fixing in case input mbuf is segmented
- Following review comments by Raslan Darawsha

This patch implements TAP TSO (TSP segmentation offload) in SW.
It uses dpdk library librte_gso.
Dpdk librte_gso library segments large TCP payloads (e.g. 64K bytes)
into smaller size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

This patch includes 2 commits:
1. Calculation of IP/TCP/UDP checksums for multi segments packets.
Previously checksum offload was skipped if the number of packet segments
was greater than 1.
This commit removes this limitation. It is required before supporting TAP TSO
since the generated TCP TSO may be composed of two segments where the first segment
includes l2,l3,l4 headers.
2. TAP TSO implementation: calling rte_gso_segment() to segment large TCP packets.
This commits creates of a small private mbuf pool in TAP PMD required by librte_gso.
The number of buffers will be 64 - each of 128 bytes length.
TSO segments size refers to TCP payload size (not including l2,l3,l4 headers)
librte_gso supports TCP segmentation above IPv4

The serie was marked as suppressed before 18.05 release in order to include
it in 18.08.

v4 (12 Jun 2018):
Updates following a rebase on top of v18.05

v5:
- Follow review comments
  https://patches.dpdk.org/patch/41011/
  https://patches.dpdk.org/patch/41013/
- Free GSO mbufs after they have been written
- move some local variables declarations to inner scope
Ophir Munk (2):
  net/tap: calculate checksums of multi segs packets
  net/tap: support TSO (TCP Segment Offload)

 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 329 +++++++++++++++++++++++++++++++++---------
 drivers/net/tap/rte_eth_tap.h |   3 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 265 insertions(+), 73 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets
  2018-06-23 23:17       ` [dpdk-dev] [PATCH v5 0/2] TAP TSO Ophir Munk
@ 2018-06-23 23:17         ` Ophir Munk
  2018-06-24 13:45           ` Wiles, Keith
  2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
  1 sibling, 1 reply; 31+ messages in thread
From: Ophir Munk @ 2018-06-23 23:17 UTC (permalink / raw)
  To: dev, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

Prior to this commit IP/UDP/TCP checksum offload calculations
were skipped in case of a multi segments packet.
This commit enables TAP checksum calculations for multi segments
packets.
The only restriction is that the first segment must contain
headers of layers 3 (IP) and 4 (UDP or TCP)

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/rte_eth_tap.c | 161 ++++++++++++++++++++++++++++++------------
 1 file changed, 114 insertions(+), 47 deletions(-)

diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index df396bf..8903646 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -415,9 +415,41 @@ tap_tx_offload_get_queue_capa(void)
 	       DEV_TX_OFFLOAD_TCP_CKSUM;
 }
 
+/* Finalize l4 checksum calculation */
 static void
-tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
-	       unsigned int l3_len)
+tap_tx_l4_cksum(uint16_t *l4_cksum, uint16_t l4_phdr_cksum,
+		uint32_t l4_raw_cksum)
+{
+	if (l4_cksum) {
+		uint32_t cksum;
+
+		cksum = __rte_raw_cksum_reduce(l4_raw_cksum);
+		cksum += l4_phdr_cksum;
+
+		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
+		cksum = (~cksum) & 0xffff;
+		if (cksum == 0)
+			cksum = 0xffff;
+		*l4_cksum = cksum;
+	}
+}
+
+/* Accumaulate L4 raw checksums */
+static void
+tap_tx_l4_add_rcksum(char *l4_data, unsigned int l4_len, uint16_t *l4_cksum,
+			uint32_t *l4_raw_cksum)
+{
+	if (l4_cksum == NULL)
+		return;
+
+	*l4_raw_cksum = __rte_raw_cksum(l4_data, l4_len, *l4_raw_cksum);
+}
+
+/* L3 and L4 pseudo headers checksum offloads */
+static void
+tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
+		unsigned int l3_len, unsigned int l4_len, uint16_t **l4_cksum,
+		uint16_t *l4_phdr_cksum, uint32_t *l4_raw_cksum)
 {
 	void *l3_hdr = packet + l2_len;
 
@@ -430,38 +462,21 @@ tap_tx_offload(char *packet, uint64_t ol_flags, unsigned int l2_len,
 		iph->hdr_checksum = (cksum == 0xffff) ? cksum : ~cksum;
 	}
 	if (ol_flags & PKT_TX_L4_MASK) {
-		uint16_t l4_len;
-		uint32_t cksum;
-		uint16_t *l4_cksum;
 		void *l4_hdr;
 
 		l4_hdr = packet + l2_len + l3_len;
 		if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM)
-			l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
+			*l4_cksum = &((struct udp_hdr *)l4_hdr)->dgram_cksum;
 		else if ((ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM)
-			l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
+			*l4_cksum = &((struct tcp_hdr *)l4_hdr)->cksum;
 		else
 			return;
-		*l4_cksum = 0;
-		if (ol_flags & PKT_TX_IPV4) {
-			struct ipv4_hdr *iph = l3_hdr;
-
-			l4_len = rte_be_to_cpu_16(iph->total_length) - l3_len;
-			cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
-		} else {
-			struct ipv6_hdr *ip6h = l3_hdr;
-
-			/* payload_len does not include ext headers */
-			l4_len = rte_be_to_cpu_16(ip6h->payload_len) -
-				l3_len + sizeof(struct ipv6_hdr);
-			cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
-		}
-		cksum += rte_raw_cksum(l4_hdr, l4_len);
-		cksum = ((cksum & 0xffff0000) >> 16) + (cksum & 0xffff);
-		cksum = (~cksum) & 0xffff;
-		if (cksum == 0)
-			cksum = 0xffff;
-		*l4_cksum = cksum;
+		**l4_cksum = 0;
+		if (ol_flags & PKT_TX_IPV4)
+			*l4_phdr_cksum = rte_ipv4_phdr_cksum(l3_hdr, 0);
+		else
+			*l4_phdr_cksum = rte_ipv6_phdr_cksum(l3_hdr, 0);
+		*l4_raw_cksum = __rte_raw_cksum(l4_hdr, l4_len, 0);
 	}
 }
 
@@ -482,17 +497,27 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
 	for (i = 0; i < nb_pkts; i++) {
 		struct rte_mbuf *mbuf = bufs[num_tx];
-		struct iovec iovecs[mbuf->nb_segs + 1];
+		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
 		char m_copy[mbuf->data_len];
+		int proto;
 		int n;
 		int j;
+		int k; /* first index in iovecs for copying segments */
+		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		uint16_t seg_len; /* length of first segment */
+		uint16_t nb_segs;
+		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
+		uint32_t l4_raw_cksum = 0; /* TCP/UDP payload raw checksum */
+		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
+		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
 		/* stats.errs will be incremented */
 		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
 			break;
 
+		l4_cksum = NULL;
 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
 			/*
 			 * TUN and TAP are created with IFF_NO_PI disabled.
@@ -505,35 +530,77 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			 * is 4 or 6, then protocol field is updated.
 			 */
 			char *buff_data = rte_pktmbuf_mtod(seg, void *);
-			j = (*buff_data & 0xf0);
-			pi.proto = (j == 0x40) ? rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
-				(j == 0x60) ? rte_cpu_to_be_16(ETHER_TYPE_IPv6) : 0x00;
+			proto = (*buff_data & 0xf0);
+			pi.proto = (proto == 0x40) ?
+				rte_cpu_to_be_16(ETHER_TYPE_IPv4) :
+				((proto == 0x60) ?
+					rte_cpu_to_be_16(ETHER_TYPE_IPv6) :
+					0x00);
 		}
 
-		iovecs[0].iov_base = &pi;
-		iovecs[0].iov_len = sizeof(pi);
-		for (j = 1; j <= mbuf->nb_segs; j++) {
-			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
-			iovecs[j].iov_base =
-				rte_pktmbuf_mtod(seg, void *);
-			seg = seg->next;
-		}
+		k = 0;
+		iovecs[k].iov_base = &pi;
+		iovecs[k].iov_len = sizeof(pi);
+		k++;
+
+		nb_segs = mbuf->nb_segs;
 		if (txq->csum &&
 		    ((mbuf->ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_IPV4) ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM ||
 		     (mbuf->ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM))) {
-			/* Support only packets with all data in the same seg */
-			if (mbuf->nb_segs > 1)
+			is_cksum = 1;
+
+			/* Support only packets with at least layer 4
+			 * header included in the first segment
+			 */
+			seg_len = rte_pktmbuf_data_len(mbuf);
+			l234_hlen = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
+			if (seg_len < l234_hlen)
 				break;
-			/* To change checksums, work on a copy of data. */
+
+			/* To change checksums, work on a
+			 * copy of l2, l3 l4 headers.
+			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
-				   rte_pktmbuf_data_len(mbuf));
-			tap_tx_offload(m_copy, mbuf->ol_flags,
-				       mbuf->l2_len, mbuf->l3_len);
-			iovecs[1].iov_base = m_copy;
+					l234_hlen);
+			tap_tx_l3_cksum(m_copy, mbuf->ol_flags,
+				       mbuf->l2_len, mbuf->l3_len, mbuf->l4_len,
+				       &l4_cksum, &l4_phdr_cksum,
+				       &l4_raw_cksum);
+			iovecs[k].iov_base = m_copy;
+			iovecs[k].iov_len = l234_hlen;
+			k++;
+
+			/* Update next iovecs[] beyond l2, l3, l4 headers */
+			if (seg_len > l234_hlen) {
+				iovecs[k].iov_len = seg_len - l234_hlen;
+				iovecs[k].iov_base =
+					rte_pktmbuf_mtod(seg, char *) +
+						l234_hlen;
+				tap_tx_l4_add_rcksum(iovecs[k].iov_base,
+					iovecs[k].iov_len, l4_cksum,
+					&l4_raw_cksum);
+				k++;
+				nb_segs++;
+			}
+			seg = seg->next;
+		}
+
+		for (j = k; j <= nb_segs; j++) {
+			iovecs[j].iov_len = rte_pktmbuf_data_len(seg);
+			iovecs[j].iov_base = rte_pktmbuf_mtod(seg, void *);
+			if (is_cksum)
+				tap_tx_l4_add_rcksum(iovecs[j].iov_base,
+					iovecs[j].iov_len, l4_cksum,
+					&l4_raw_cksum);
+			seg = seg->next;
 		}
+
+		if (is_cksum)
+			tap_tx_l4_cksum(l4_cksum, l4_phdr_cksum, l4_raw_cksum);
+
 		/* copy the tx frame data */
-		n = writev(txq->fd, iovecs, mbuf->nb_segs + 1);
+		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v5 2/2] net/tap: support TSO (TCP Segment Offload)
  2018-06-23 23:17       ` [dpdk-dev] [PATCH v5 0/2] TAP TSO Ophir Munk
  2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-06-23 23:17         ` Ophir Munk
  1 sibling, 0 replies; 31+ messages in thread
From: Ophir Munk @ 2018-06-23 23:17 UTC (permalink / raw)
  To: dev, Keith Wiles; +Cc: Thomas Monjalon, Olga Shern, Ophir Munk

This commit implements TCP segmentation offload in TAP.
librte_gso library is used to segment large TCP payloads (e.g. packets
of 64K bytes size) into smaller MTU size buffers.
By supporting TSO offload capability in software a TAP device can be used
as a failsafe sub device and be paired with another PCI device which
supports TSO capability in HW.

For more details on librte_gso implementation please refer to dpdk
documentation.
The number of newly generated TCP TSO segments is limited to 64.

Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
---
 drivers/net/tap/Makefile      |   2 +-
 drivers/net/tap/rte_eth_tap.c | 174 +++++++++++++++++++++++++++++++++++-------
 drivers/net/tap/rte_eth_tap.h |   3 +
 mk/rte.app.mk                 |   4 +-
 4 files changed, 154 insertions(+), 29 deletions(-)

diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index ccc5c5f..3243365 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -24,7 +24,7 @@ CFLAGS += -I.
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
-LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_bus_vdev -lrte_gso
 
 CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
 
diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c
index 8903646..d137f58 100644
--- a/drivers/net/tap/rte_eth_tap.c
+++ b/drivers/net/tap/rte_eth_tap.c
@@ -17,6 +17,7 @@
 #include <rte_ip.h>
 #include <rte_string_fns.h>
 
+#include <assert.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
@@ -55,6 +56,12 @@
 #define ETH_TAP_CMP_MAC_FMT     "0123456789ABCDEFabcdef"
 #define ETH_TAP_MAC_ARG_FMT     ETH_TAP_MAC_FIXED "|" ETH_TAP_USR_MAC_FMT
 
+#define TAP_GSO_MBUFS_PER_CORE	128
+#define TAP_GSO_MBUF_SEG_SIZE	128
+#define TAP_GSO_MBUF_CACHE_SIZE	4
+#define TAP_GSO_MBUFS_NUM \
+	(TAP_GSO_MBUFS_PER_CORE * TAP_GSO_MBUF_CACHE_SIZE)
+
 static struct rte_vdev_driver pmd_tap_drv;
 static struct rte_vdev_driver pmd_tun_drv;
 
@@ -412,7 +419,8 @@ tap_tx_offload_get_queue_capa(void)
 	return DEV_TX_OFFLOAD_MULTI_SEGS |
 	       DEV_TX_OFFLOAD_IPV4_CKSUM |
 	       DEV_TX_OFFLOAD_UDP_CKSUM |
-	       DEV_TX_OFFLOAD_TCP_CKSUM;
+	       DEV_TX_OFFLOAD_TCP_CKSUM |
+	       DEV_TX_OFFLOAD_TCP_TSO;
 }
 
 /* Finalize l4 checksum calculation */
@@ -480,23 +488,16 @@ tap_tx_l3_cksum(char *packet, uint64_t ol_flags, unsigned int l2_len,
 	}
 }
 
-/* Callback to handle sending packets from the tap interface
- */
-static uint16_t
-pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+static inline void
+tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs,
+			struct rte_mbuf **pmbufs,
+			uint16_t *num_packets, unsigned long *num_tx_bytes)
 {
-	struct tx_queue *txq = queue;
-	uint16_t num_tx = 0;
-	unsigned long num_tx_bytes = 0;
-	uint32_t max_size;
 	int i;
+	uint16_t l234_hlen;
 
-	if (unlikely(nb_pkts == 0))
-		return 0;
-
-	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
-	for (i = 0; i < nb_pkts; i++) {
-		struct rte_mbuf *mbuf = bufs[num_tx];
+	for (i = 0; i < num_mbufs; i++) {
+		struct rte_mbuf *mbuf = pmbufs[i];
 		struct iovec iovecs[mbuf->nb_segs + 2];
 		struct tun_pi pi = { .flags = 0, .proto = 0x00 };
 		struct rte_mbuf *seg = mbuf;
@@ -504,8 +505,7 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int proto;
 		int n;
 		int j;
-		int k; /* first index in iovecs for copying segments */
-		uint16_t l234_hlen; /* length of layers 2,3,4 headers */
+		int k; /* current index in iovecs for copying segments */
 		uint16_t seg_len; /* length of first segment */
 		uint16_t nb_segs;
 		uint16_t *l4_cksum; /* l4 checksum (pseudo header + payload) */
@@ -513,10 +513,6 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		uint16_t l4_phdr_cksum = 0; /* TCP/UDP pseudo header checksum */
 		uint16_t is_cksum = 0; /* in case cksum should be offloaded */
 
-		/* stats.errs will be incremented */
-		if (rte_pktmbuf_pkt_len(mbuf) > max_size)
-			break;
-
 		l4_cksum = NULL;
 		if (txq->type == ETH_TUNTAP_TYPE_TUN) {
 			/*
@@ -558,8 +554,8 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			if (seg_len < l234_hlen)
 				break;
 
-			/* To change checksums, work on a
-			 * copy of l2, l3 l4 headers.
+			/* To change checksums, work on a * copy of l2, l3
+			 * headers + l4 pseudo header
 			 */
 			rte_memcpy(m_copy, rte_pktmbuf_mtod(mbuf, void *),
 					l234_hlen);
@@ -603,13 +599,90 @@ pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		n = writev(txq->fd, iovecs, j);
 		if (n <= 0)
 			break;
+		(*num_packets)++;
+		(*num_tx_bytes) += rte_pktmbuf_pkt_len(mbuf);
+	}
+}
+
+/* Callback to handle sending packets from the tap interface
+ */
+static uint16_t
+pmd_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx = 0;
+	uint16_t num_packets = 0;
+	unsigned long num_tx_bytes = 0;
+	uint32_t max_size;
+	int i;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
 
+	struct rte_mbuf *gso_mbufs[MAX_GSO_MBUFS];
+	max_size = *txq->mtu + (ETHER_HDR_LEN + ETHER_CRC_LEN + 4);
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mbuf_in = bufs[num_tx];
+		struct rte_mbuf **mbuf;
+		uint16_t num_mbufs = 0;
+		uint16_t tso_segsz = 0;
+		int ret;
+		uint16_t hdrs_len;
+		int j;
+		uint64_t tso;
+
+		tso = mbuf_in->ol_flags & PKT_TX_TCP_SEG;
+		if (tso) {
+			struct rte_gso_ctx *gso_ctx = &txq->gso_ctx;
+
+			assert(gso_ctx != NULL);
+
+			/* TCP segmentation implies TCP checksum offload */
+			mbuf_in->ol_flags |= PKT_TX_TCP_CKSUM;
+
+			/* gso size is calculated without ETHER_CRC_LEN */
+			hdrs_len = mbuf_in->l2_len + mbuf_in->l3_len +
+					mbuf_in->l4_len;
+			tso_segsz = mbuf_in->tso_segsz + hdrs_len;
+			if (unlikely(tso_segsz == hdrs_len) ||
+				tso_segsz > *txq->mtu) {
+				txq->stats.errs++;
+				break;
+			}
+			gso_ctx->gso_size = tso_segsz;
+			ret = rte_gso_segment(mbuf_in, /* packet to segment */
+				gso_ctx, /* gso control block */
+				(struct rte_mbuf **)&gso_mbufs, /* out mbufs */
+				RTE_DIM(gso_mbufs)); /* max tso mbufs */
+
+			/* ret contains the number of new created mbufs */
+			if (ret < 0)
+				break;
+
+			mbuf = gso_mbufs;
+			num_mbufs = ret;
+		} else {
+			/* stats.errs will be incremented */
+			if (rte_pktmbuf_pkt_len(mbuf_in) > max_size)
+				break;
+
+			/* ret 0 indicates no new mbufs were created */
+			ret = 0;
+			mbuf = &mbuf_in;
+			num_mbufs = 1;
+		}
+
+		tap_write_mbufs(txq, num_mbufs, mbuf,
+				&num_packets, &num_tx_bytes);
 		num_tx++;
-		num_tx_bytes += mbuf->pkt_len;
-		rte_pktmbuf_free(mbuf);
+		/* free original mbuf */
+		rte_pktmbuf_free(mbuf_in);
+		/* free tso mbufs */
+		for (j = 0; j < ret; j++)
+			rte_pktmbuf_free(mbuf[j]);
 	}
 
-	txq->stats.opackets += num_tx;
+	txq->stats.opackets += num_packets;
 	txq->stats.errs += nb_pkts - num_tx;
 	txq->stats.obytes += num_tx_bytes;
 
@@ -1071,31 +1144,75 @@ tap_mac_set(struct rte_eth_dev *dev, struct ether_addr *mac_addr)
 }
 
 static int
+tap_gso_ctx_setup(struct rte_gso_ctx *gso_ctx, struct rte_eth_dev *dev)
+{
+	uint32_t gso_types;
+	char pool_name[64];
+
+	/*
+	 * Create private mbuf pool with TAP_GSO_MBUF_SEG_SIZE bytes
+	 * size per mbuf use this pool for both direct and indirect mbufs
+	 */
+
+	struct rte_mempool *mp;      /* Mempool for GSO packets */
+
+	/* initialize GSO context */
+	gso_types = DEV_TX_OFFLOAD_TCP_TSO;
+	snprintf(pool_name, sizeof(pool_name), "mp_%s", dev->device->name);
+	mp = rte_mempool_lookup((const char *)pool_name);
+	if (!mp) {
+		mp = rte_pktmbuf_pool_create(pool_name, TAP_GSO_MBUFS_NUM,
+			TAP_GSO_MBUF_CACHE_SIZE, 0,
+			RTE_PKTMBUF_HEADROOM + TAP_GSO_MBUF_SEG_SIZE,
+			SOCKET_ID_ANY);
+		if (!mp) {
+			struct pmd_internals *pmd = dev->data->dev_private;
+			RTE_LOG(DEBUG, PMD, "%s: failed to create mbuf pool for device %s\n",
+				pmd->name, dev->device->name);
+			return -1;
+		}
+	}
+
+	gso_ctx->direct_pool = mp;
+	gso_ctx->indirect_pool = mp;
+	gso_ctx->gso_types = gso_types;
+	gso_ctx->gso_size = 0; /* gso_size is set in tx_burst() per packet */
+	gso_ctx->flag = 0;
+
+	return 0;
+}
+
+static int
 tap_setup_queue(struct rte_eth_dev *dev,
 		struct pmd_internals *internals,
 		uint16_t qid,
 		int is_rx)
 {
+	int ret;
 	int *fd;
 	int *other_fd;
 	const char *dir;
 	struct pmd_internals *pmd = dev->data->dev_private;
 	struct rx_queue *rx = &internals->rxq[qid];
 	struct tx_queue *tx = &internals->txq[qid];
+	struct rte_gso_ctx *gso_ctx;
 
 	if (is_rx) {
 		fd = &rx->fd;
 		other_fd = &tx->fd;
 		dir = "rx";
+		gso_ctx = NULL;
 	} else {
 		fd = &tx->fd;
 		other_fd = &rx->fd;
 		dir = "tx";
+		gso_ctx = &tx->gso_ctx;
 	}
 	if (*fd != -1) {
 		/* fd for this queue already exists */
 		TAP_LOG(DEBUG, "%s: fd %d for %s queue qid %d exists",
 			pmd->name, *fd, dir, qid);
+		gso_ctx = NULL;
 	} else if (*other_fd != -1) {
 		/* Only other_fd exists. dup it */
 		*fd = dup(*other_fd);
@@ -1120,6 +1237,11 @@ tap_setup_queue(struct rte_eth_dev *dev,
 
 	tx->mtu = &dev->data->mtu;
 	rx->rxmode = &dev->data->dev_conf.rxmode;
+	if (gso_ctx) {
+		ret = tap_gso_ctx_setup(gso_ctx, dev);
+		if (ret)
+			return -1;
+	}
 
 	tx->type = pmd->type;
 
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 7b21d0d..44e2773 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -15,6 +15,7 @@
 
 #include <rte_ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_gso.h>
 #include "tap_log.h"
 
 #ifdef IFF_MULTI_QUEUE
@@ -22,6 +23,7 @@
 #else
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
+#define MAX_GSO_MBUFS 64
 
 enum rte_tuntap_type {
 	ETH_TUNTAP_TYPE_UNKNOWN,
@@ -59,6 +61,7 @@ struct tx_queue {
 	uint16_t *mtu;                  /* Pointer to MTU from dev_data */
 	uint16_t csum:1;                /* Enable checksum offloading */
 	struct pkt_stats stats;         /* Stats for this TX queue */
+	struct rte_gso_ctx gso_ctx;     /* GSO context */
 };
 
 struct pmd_internals {
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 1e32c83..e2ee879 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -38,8 +38,6 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PDUMP)          += -lrte_pdump
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR)    += -lrte_distributor
 _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)        += -lrte_ip_frag
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
-_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
 # librte_acl needs --whole-archive because of weak functions
@@ -61,6 +59,8 @@ endif
 _LDLIBS-y += --whole-archive
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CFGFILE)        += -lrte_cfgfile
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
+_LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MEMBER)         += -lrte_member
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
-- 
2.7.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets
  2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
@ 2018-06-24 13:45           ` Wiles, Keith
  2018-06-27 13:11             ` Ferruh Yigit
  0 siblings, 1 reply; 31+ messages in thread
From: Wiles, Keith @ 2018-06-24 13:45 UTC (permalink / raw)
  To: Ophir Munk; +Cc: dev, Thomas Monjalon, Olga Shern



> On Jun 23, 2018, at 6:17 PM, Ophir Munk <ophirmu@mellanox.com> wrote:
> 
> Prior to this commit IP/UDP/TCP checksum offload calculations
> were skipped in case of a multi segments packet.
> This commit enables TAP checksum calculations for multi segments
> packets.
> The only restriction is that the first segment must contain
> headers of layers 3 (IP) and 4 (UDP or TCP)
> 
> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> —

Looks good. Still could have had a few more blank lines IMHO :-)

Acked-by: Keith Wiles <keith.wiles@intel.com> for Series

Regards,
Keith


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets
  2018-06-24 13:45           ` Wiles, Keith
@ 2018-06-27 13:11             ` Ferruh Yigit
  0 siblings, 0 replies; 31+ messages in thread
From: Ferruh Yigit @ 2018-06-27 13:11 UTC (permalink / raw)
  To: Wiles, Keith, Ophir Munk; +Cc: dev, Thomas Monjalon, Olga Shern

On 6/24/2018 2:45 PM, Wiles, Keith wrote:
> 
> 
>> On Jun 23, 2018, at 6:17 PM, Ophir Munk <ophirmu@mellanox.com> wrote:
>>
>> Prior to this commit IP/UDP/TCP checksum offload calculations
>> were skipped in case of a multi segments packet.
>> This commit enables TAP checksum calculations for multi segments
>> packets.
>> The only restriction is that the first segment must contain
>> headers of layers 3 (IP) and 4 (UDP or TCP)
>>
>> Reviewed-by: Raslan Darawsheh <rasland@mellanox.com>
>> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
>> —
> 
> Looks good. Still could have had a few more blank lines IMHO :-)
> 
> Acked-by: Keith Wiles <keith.wiles@intel.com> for Series

Series applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-06-27 13:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-09 21:10 [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ophir Munk
2018-03-09 21:10 ` [dpdk-dev] [RFC 1/2] net/tap: calculate checksum for multi segs packets Ophir Munk
2018-04-09 22:33   ` [dpdk-dev] [PATCH v1 0/2] TAP TSO Ophir Munk
2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
2018-04-09 22:33     ` [dpdk-dev] [PATCH v1 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
2018-04-22 11:30       ` [dpdk-dev] [PATCH v2 0/2] TAP TSO Ophir Munk
2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
2018-05-07 21:54           ` [dpdk-dev] [PATCH v3 0/2] TAP TSO Ophir Munk
2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
2018-05-07 21:54             ` [dpdk-dev] [PATCH v3 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
2018-05-31 13:52           ` [dpdk-dev] [PATCH v2 1/2] net/tap: calculate checksums of multi segs packets Ferruh Yigit
2018-05-31 13:54             ` Ferruh Yigit
2018-04-22 11:30         ` [dpdk-dev] [PATCH v2 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
2018-06-12 16:31   ` [dpdk-dev] [PATCH v4 0/2] TAP TSO Ophir Munk
2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
2018-06-12 17:17       ` Wiles, Keith
2018-06-12 16:31     ` [dpdk-dev] [PATCH v4 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
2018-06-12 17:22       ` Wiles, Keith
2018-06-13 16:04       ` Wiles, Keith
2018-06-14  7:59         ` Ophir Munk
2018-06-14 12:58           ` Wiles, Keith
2018-06-23 23:17       ` [dpdk-dev] [PATCH v5 0/2] TAP TSO Ophir Munk
2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 1/2] net/tap: calculate checksums of multi segs packets Ophir Munk
2018-06-24 13:45           ` Wiles, Keith
2018-06-27 13:11             ` Ferruh Yigit
2018-06-23 23:17         ` [dpdk-dev] [PATCH v5 2/2] net/tap: support TSO (TCP Segment Offload) Ophir Munk
2018-03-09 21:10 ` [dpdk-dev] [RFC 2/2] net/tap: implement TAP TSO Ophir Munk
2018-04-09 16:38 ` [dpdk-dev] [RFC 0/2] TAP TSO Implementation Ferruh Yigit
2018-04-09 22:37   ` Ophir Munk
2018-04-10 14:30     ` Ferruh Yigit
2018-04-10 15:31       ` Ophir Munk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).