DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs
@ 2016-11-24  8:56 Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function Olivier Matz
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

This patchset fixes the transmission of cloned mbufs when using
virtio + TSO. The problem is we need to fix the L4 checksum in the
packet, but it should be considered as read-only, as pointed-out
by Stephen here:
http://dpdk.org/ml/archives/dev/2016-October/048873.html

Unfortunatly the patchset is quite big, but I did not manage to
find a shorter solution. The first patches add some mbuf helpers
that are used in virtio in the last patch.

This last patch adds a zone for each tx ring entry where headers
can be copied, patched, and referenced by virtio descriptors in
case the mbuf is read-only. If its not the case, the mbuf is
modified as before.

I tested with the same test plan than the one described in
http://dpdk.org/ml/archives/dev/2016-October/048092.html
(only the TSO test case).

I also replayed the test with the following patches to validate
the code path for:

- segmented packets (it forces a local copy in virtio_tso_fix_cksum)

--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -279,7 +279,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
                seg = seg->next;
        }
 
-       if (off + len <= rte_pktmbuf_data_len(seg))
+       if (0 && off + len <= rte_pktmbuf_data_len(seg))
                return rte_pktmbuf_mtod_offset(seg, char *, off);
 
        /* rare case: header is split among several segments */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 9dc6f10..5a4312a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1671,7 +1671,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
 static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
        uint32_t off, uint32_t len, void *buf)
 {
-       if (likely(off + len <= rte_pktmbuf_data_len(m)))
+       if (likely(0 && off + len <= rte_pktmbuf_data_len(m)))
                return rte_pktmbuf_mtod_offset(m, char *, off);
        else
                return __rte_pktmbuf_read(m, off, len, buf);

- and for shared mbuf (force to use the buffer in virtio tx ring)

--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -225,7 +225,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t hdr_sz)
        int shared = 0;
 
        /* mbuf is write-only, we need to copy the headers in a linear buffer */
-       if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+       if (unlikely(1 || rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
                shared = 1;
 
                /* network headers are too big, there's nothing we can do */


Olivier Matz (5):
  mbuf: remove const attribute in mbuf read function
  mbuf: new helper to check if a mbuf is shared
  mbuf: new helper to write data in a mbuf chain
  mbuf: new helper to copy data from a mbuf
  net/virtio: fix Tso when mbuf is shared

 app/test/test_mbuf.c                 |  62 +++++++++++++-
 drivers/net/virtio/virtio_rxtx.c     | 119 ++++++++++++++++++--------
 drivers/net/virtio/virtqueue.h       |   2 +
 lib/librte_mbuf/rte_mbuf.c           |  46 +++++++++-
 lib/librte_mbuf/rte_mbuf.h           | 157 ++++++++++++++++++++++++++++++++++-
 lib/librte_mbuf/rte_mbuf_version.map |   6 ++
 6 files changed, 347 insertions(+), 45 deletions(-)

-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function
  2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
@ 2016-11-24  8:56 ` Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared Olivier Matz
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

There is no good reason to have this const attribute: rte_pktmbuf_read()
returns a pointer which is either in a private buffer, or in the mbuf.

In the first case, it is clearly not const. In the second case, it is up
to the user to check that the mbuf is not shared and that data can be
modified.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 lib/librte_mbuf/rte_mbuf.c | 2 +-
 lib/librte_mbuf/rte_mbuf.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 63f43c8..b31958e 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -265,7 +265,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 }
 
 /* read len data bytes in a mbuf at specified offset (internal) */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
 	uint32_t len, void *buf)
 {
 	const struct rte_mbuf *seg = m;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ead7c6e..14956f6 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1576,7 +1576,7 @@ static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)
 /**
  * @internal used by rte_pktmbuf_read().
  */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
 	uint32_t len, void *buf);
 
 /**
@@ -1599,7 +1599,7 @@ const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
  *   The pointer to the data, either in the mbuf if it is contiguous,
  *   or in the user buffer. If mbuf is too small, NULL is returned.
  */
-static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
+static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
 	uint32_t off, uint32_t len, void *buf)
 {
 	if (likely(off + len <= rte_pktmbuf_data_len(m)))
-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared
  2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function Olivier Matz
@ 2016-11-24  8:56 ` Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain Olivier Matz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

Introduce 2 new helpers rte_pktmbuf_seg_is_shared() and
rte_pktmbuf_data_is_shared() to check if the packet data inside
a mbuf is shared (and shall not be modified).

To avoid a "discards const qualifier" error, add a const to the argument
of rte_mbuf_from_indirect().

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c       | 34 +++++++++++++++++++---
 lib/librte_mbuf/rte_mbuf.h | 71 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index c0823ea..7656a4d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -333,6 +333,7 @@ testclone_testupdate_testdetach(void)
 	struct rte_mbuf *m = NULL;
 	struct rte_mbuf *clone = NULL;
 	struct rte_mbuf *clone2 = NULL;
+	struct rte_mbuf *m2 = NULL;
 	unaligned_uint32_t *data;
 
 	/* alloc a mbuf */
@@ -384,6 +385,11 @@ testclone_testupdate_testdetach(void)
 	if (*data != MAGIC_DATA)
 		GOTO_FAIL("invalid data in clone->next\n");
 
+	if (rte_pktmbuf_seg_is_shared(m) == 0)
+		GOTO_FAIL("m should be marked as shared\n");
+	if (rte_pktmbuf_seg_is_shared(clone) == 0)
+		GOTO_FAIL("clone should be marked as shared\n");
+
 	if (rte_mbuf_refcnt_read(m) != 2)
 		GOTO_FAIL("invalid refcnt in m\n");
 
@@ -410,14 +416,32 @@ testclone_testupdate_testdetach(void)
 	if (rte_mbuf_refcnt_read(m->next) != 3)
 		GOTO_FAIL("invalid refcnt in m->next\n");
 
+	/* prepend data to one of the clone */
+	m2 = rte_pktmbuf_alloc(pktmbuf_pool);
+	if (m2 == NULL)
+		GOTO_FAIL("cannot allocate m2");
+	rte_pktmbuf_append(m2, sizeof(uint32_t));
+	rte_pktmbuf_chain(m2, clone);
+	clone = NULL;
+
+	if (rte_pktmbuf_data_is_shared(m2, 0, sizeof(uint32_t)))
+		GOTO_FAIL("m2 headers should not be marked as shared");
+	if (rte_pktmbuf_data_is_shared(m2, sizeof(uint32_t),
+			rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
+		GOTO_FAIL("m2 data should be marked as shared");
+
 	/* free mbuf */
 	rte_pktmbuf_free(m);
-	rte_pktmbuf_free(clone);
-	rte_pktmbuf_free(clone2);
-
 	m = NULL;
-	clone = NULL;
+	rte_pktmbuf_free(m2);
+	m2 = NULL;
+
+	if (rte_pktmbuf_seg_is_shared(clone2))
+		GOTO_FAIL("clone2 should not be marked as shared\n");
+
+	rte_pktmbuf_free(clone2);
 	clone2 = NULL;
+
 	printf("%s ok\n", __func__);
 	return 0;
 
@@ -428,6 +452,8 @@ testclone_testupdate_testdetach(void)
 		rte_pktmbuf_free(clone);
 	if (clone2)
 		rte_pktmbuf_free(clone2);
+	if (m2)
+		rte_pktmbuf_free(m2);
 	return -1;
 }
 
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 14956f6..cd77a56 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -576,7 +576,7 @@ rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
  *   The address of the direct mbuf corresponding to buffer_addr.
  */
 static inline struct rte_mbuf *
-rte_mbuf_from_indirect(struct rte_mbuf *mi)
+rte_mbuf_from_indirect(const struct rte_mbuf *mi)
 {
 	return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + mi->priv_size);
 }
@@ -1574,6 +1574,75 @@ static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)
 }
 
 /**
+ * Test if a mbuf segment is shared
+ *
+ * Return true if the data embedded in this segment is shared by several
+ * mbufs. In this case, the mbuf data should be considered as read-only.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf segment is shared (read-only)
+ *   - (0), the mbuf segment is not shared (writable)
+ */
+static inline int rte_pktmbuf_seg_is_shared(const struct rte_mbuf *m)
+{
+	if (rte_mbuf_refcnt_read(m) > 1)
+		return 1;
+
+	if (RTE_MBUF_INDIRECT(m) &&
+			rte_mbuf_refcnt_read(rte_mbuf_from_indirect(m)) > 1)
+		return 1;
+
+	return 0;
+}
+
+/**
+ * Test if some data in an mbuf chain is shared
+ *
+ * Return true if the specified data area in the mbuf chain is shared by
+ * several mbufs. In this case, this data should be considered as
+ * read-only.
+ *
+ * If the area described by off and len exceeds the bounds of the mbuf
+ * chain (off + len <= rte_pktmbuf_pkt_len()), the exceeding part of the
+ * area is ignored.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf data is shared (read-only)
+ *   - (0), the mbuf data is not shared (writable)
+ */
+static inline int rte_pktmbuf_data_is_shared(const struct rte_mbuf *m,
+	uint32_t off, uint32_t len)
+{
+	const struct rte_mbuf *seg = m;
+
+	if (likely(off + len <= rte_pktmbuf_data_len(seg)))
+		return rte_pktmbuf_seg_is_shared(seg);
+
+	while (seg->next && off >= rte_pktmbuf_data_len(seg)) {
+		off -= rte_pktmbuf_data_len(seg);
+		seg = seg->next;
+	}
+
+	if (off + len <= rte_pktmbuf_data_len(seg))
+		return rte_pktmbuf_seg_is_shared(seg);
+
+	while (seg->next && len > 0) {
+		if (rte_pktmbuf_seg_is_shared(seg))
+			return 1;
+
+		len -= (rte_pktmbuf_data_len(seg) - off);
+		off = 0;
+		seg = seg->next;
+	}
+
+	return 0;
+}
+
+/**
  * @internal used by rte_pktmbuf_read().
  */
 void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain
  2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared Olivier Matz
@ 2016-11-24  8:56 ` Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared Olivier Matz
  4 siblings, 0 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

Introduce a new helper to write data in a chain of mbufs,
spreading it in the segments.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c                 | 21 +++++++++++++++
 lib/librte_mbuf/rte_mbuf.c           | 44 +++++++++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf.h           | 50 ++++++++++++++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf_version.map |  6 +++++
 4 files changed, 121 insertions(+)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 7656a4d..5f1bc5d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -335,6 +335,10 @@ testclone_testupdate_testdetach(void)
 	struct rte_mbuf *clone2 = NULL;
 	struct rte_mbuf *m2 = NULL;
 	unaligned_uint32_t *data;
+	uint32_t magic = MAGIC_DATA;
+	uint32_t check_data[2];
+
+	memset(check_data, 0, sizeof(check_data));
 
 	/* alloc a mbuf */
 	m = rte_pktmbuf_alloc(pktmbuf_pool);
@@ -421,6 +425,8 @@ testclone_testupdate_testdetach(void)
 	if (m2 == NULL)
 		GOTO_FAIL("cannot allocate m2");
 	rte_pktmbuf_append(m2, sizeof(uint32_t));
+	if (rte_pktmbuf_write(m2, 0, sizeof(uint32_t), &magic) < 0)
+		GOTO_FAIL("cannot write data in m2");
 	rte_pktmbuf_chain(m2, clone);
 	clone = NULL;
 
@@ -430,6 +436,21 @@ testclone_testupdate_testdetach(void)
 			rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
 		GOTO_FAIL("m2 data should be marked as shared");
 
+	/* check data content */
+	data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t), check_data);
+	if (data == NULL)
+		GOTO_FAIL("cannot read data");
+	if (*data != MAGIC_DATA)
+		GOTO_FAIL("invalid data");
+	if (data == check_data)
+		GOTO_FAIL("data should not have been copied");
+	data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t) * 2, check_data);
+	if (data == NULL)
+		GOTO_FAIL("cannot read data");
+	if (data[0] != MAGIC_DATA || data[1] != MAGIC_DATA)
+		GOTO_FAIL("invalid data");
+	if (data != check_data)
+		GOTO_FAIL("data should have been copied");
 	/* free mbuf */
 	rte_pktmbuf_free(m);
 	m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index b31958e..ed56193 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -298,6 +298,50 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
 	return buf;
 }
 
+/* write len data bytes in a mbuf at specified offset (internal) */
+int
+__rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+	uint32_t len, const void *buf)
+{
+	const struct rte_mbuf *seg = m;
+	uint32_t buf_off = 0, copy_len;
+	char *dst;
+
+	if (off + len > rte_pktmbuf_pkt_len(m))
+		return -1;
+
+	while (off >= rte_pktmbuf_data_len(seg)) {
+		off -= rte_pktmbuf_data_len(seg);
+		seg = seg->next;
+	}
+
+	dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+	if (buf == dst)
+		return 0;
+
+	if (off + len <= rte_pktmbuf_data_len(seg)) {
+		RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+		rte_memcpy(dst, buf, len);
+		return 0;
+	}
+
+	/* copy data in several segments */
+	while (len > 0) {
+		RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+		copy_len = rte_pktmbuf_data_len(seg) - off;
+		if (copy_len > len)
+			copy_len = len;
+		dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+		rte_memcpy(dst, (const char *)buf + buf_off, copy_len);
+		off = 0;
+		buf_off += copy_len;
+		len -= copy_len;
+		seg = seg->next;
+	}
+
+	return 0;
+}
+
 /*
  * Get the name of a RX offload flag. Must be kept synchronized with flag
  * definitions in rte_mbuf.h.
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index cd77a56..e898d25 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1678,6 +1678,56 @@ static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
 }
 
 /**
+ * @internal used by rte_pktmbuf_write().
+ */
+int __rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+	uint32_t len, const void *buf);
+
+/**
+ * Write len data bytes in a mbuf at specified offset.
+ *
+ * If the mbuf is contiguous between off and off+len, rte_memcpy() is
+ * called. Else, it will split the data in the segments.
+ *
+ * The caller must ensure that all destination segments are writable
+ * (not shared).
+ *
+ * If the destination pointer in the mbuf is the same than the source
+ * buffer, the function do nothing and is successful.
+ *
+ * If the mbuf is too small, the function fails.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset of the data in the mbuf.
+ * @param len
+ *   The amount of bytes to read.
+ * @param buf
+ *   The buffer where data is copied if it is not contiguous in mbuf
+ *   data. Its length should be at least equal to the len parameter.
+ * @return
+ *   - (0) on success (data is updated in the mbuf)
+ *   - (-1) on error: mbuf is too small or not writable
+ */
+static inline int rte_pktmbuf_write(const struct rte_mbuf *m,
+	uint32_t off, uint32_t len, const void *buf)
+{
+	char *dst = rte_pktmbuf_mtod_offset(m, char *, off);
+
+	if (buf == dst)
+		return 0;
+
+	if (off + len <= rte_pktmbuf_data_len(m)) {
+		RTE_ASSERT(!rte_pktmbuf_is_shared(m));
+		rte_memcpy(dst, buf, len);
+		return 0;
+	}
+
+	return __rte_pktmbuf_write(m, off, len, buf);
+}
+
+/**
  * Chain an mbuf to another, thereby creating a segmented packet.
  *
  * Note: The implementation will do a linear walk over the segments to find
diff --git a/lib/librte_mbuf/rte_mbuf_version.map b/lib/librte_mbuf/rte_mbuf_version.map
index 6e2ea84..b2c057e 100644
--- a/lib/librte_mbuf/rte_mbuf_version.map
+++ b/lib/librte_mbuf/rte_mbuf_version.map
@@ -35,3 +35,9 @@ DPDK_16.11 {
 	rte_get_tx_ol_flag_list;
 
 } DPDK_2.1;
+
+DPDK_17.02 {
+	global:
+
+	__rte_pktmbuf_write;
+} DPDK_16.11;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf
  2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
                   ` (2 preceding siblings ...)
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain Olivier Matz
@ 2016-11-24  8:56 ` Olivier Matz
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared Olivier Matz
  4 siblings, 0 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c       |  7 +++++++
 lib/librte_mbuf/rte_mbuf.h | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 5f1bc5d..73fd7df 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -451,6 +451,13 @@ testclone_testupdate_testdetach(void)
 		GOTO_FAIL("invalid data");
 	if (data != check_data)
 		GOTO_FAIL("data should have been copied");
+	if (rte_pktmbuf_read_copy(m2, 0, sizeof(uint32_t), check_data) < 0)
+		GOTO_FAIL("cannot copy data");
+	if (check_data[0] != MAGIC_DATA)
+		GOTO_FAIL("invalid data");
+	if (data != check_data)
+		GOTO_FAIL("data should have been copied");
+
 	/* free mbuf */
 	rte_pktmbuf_free(m);
 	m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index e898d25..edae89f 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1643,7 +1643,7 @@ static inline int rte_pktmbuf_data_is_shared(const struct rte_mbuf *m,
 }
 
 /**
- * @internal used by rte_pktmbuf_read().
+ * @internal used by rte_pktmbuf_read() and rte_pktmbuf_read_copy().
  */
 void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
 	uint32_t len, void *buf);
@@ -1728,6 +1728,36 @@ static inline int rte_pktmbuf_write(const struct rte_mbuf *m,
 }
 
 /**
+ * Copy data from a mbuf into a linear buffer
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset of the data in the mbuf.
+ * @param len
+ *   The amount of bytes to copy.
+ * @param buf
+ *   The buffer where data is copied, it should be at least
+ *   as large as len.
+ * @return
+ *   - (0) on success
+ *   - (-1) on error: mbuf is too small
+ */
+static inline int rte_pktmbuf_read_copy(const struct rte_mbuf *m,
+	uint32_t off, uint32_t len, void *buf)
+{
+	if (likely(off + len <= rte_pktmbuf_data_len(m))) {
+		rte_memcpy(buf, rte_pktmbuf_mtod_offset(m, char *, off), len);
+		return 0;
+	}
+
+	if (__rte_pktmbuf_read(m, off, len, buf) == NULL)
+		return -1;
+
+	return 0;
+}
+
+/**
  * Chain an mbuf to another, thereby creating a segmented packet.
  *
  * Note: The implementation will do a linear walk over the segments to find
-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
                   ` (3 preceding siblings ...)
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf Olivier Matz
@ 2016-11-24  8:56 ` Olivier Matz
  2016-12-14  7:27   ` Yuanhan Liu
  2017-01-09 17:59   ` Stephen Hemminger
  4 siblings, 2 replies; 14+ messages in thread
From: Olivier Matz @ 2016-11-24  8:56 UTC (permalink / raw)
  To: dev, yuanhan.liu; +Cc: maxime.coquelin, huawei.xie, stephen

With virtio, doing tso requires to modify the network
packet data:
- the dpdk API requires to set the l4 checksum to an
  Intel-Nic-like pseudo header checksum that does
  not include the ip length
- the virtio peer expects that the l4 checksum is
  a standard pseudo header checksum.

This is a problem with shared packets, because they
should not be modified.

This patch fixes this issue by copying the headers into
a linear buffer in that case. This buffer is located in
the virtio_tx_region, at the same place where the
virtio header is stored.

The size of this buffer is set to 256, which should
be enough in all cases:
  sizeof(ethernet) + sizeof(vlan) * 2 + sizeof(ip6)
    sizeof(ip6-ext) + sizeof(tcp) + sizeof(tcp-opts)
  = 14 + 8 + 40 + sizeof(ip6-ext) + 40 + sizeof(tcp-opts)
  = 102 + sizeof(ip6-ext) + sizeof(tcp-opts)

Fixes: 696573046e9e ("net/virtio: support TSO")

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 drivers/net/virtio/virtio_rxtx.c | 119 +++++++++++++++++++++++++++------------
 drivers/net/virtio/virtqueue.h   |   2 +
 2 files changed, 85 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 22d97a4..577c775 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -211,43 +211,73 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)
 
 /* When doing TSO, the IP length is not included in the pseudo header
  * checksum of the packet given to the PMD, but for virtio it is
- * expected.
+ * expected. Fix the mbuf or a copy if the mbuf is shared.
  */
-static void
-virtio_tso_fix_cksum(struct rte_mbuf *m)
+static unsigned int
+virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t hdr_sz)
 {
-	/* common case: header is not fragmented */
-	if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
-			m->l4_len)) {
-		struct ipv4_hdr *iph;
-		struct ipv6_hdr *ip6h;
-		struct tcp_hdr *th;
-		uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
-		uint32_t tmp;
-
-		iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
-		th = RTE_PTR_ADD(iph, m->l3_len);
-		if ((iph->version_ihl >> 4) == 4) {
-			iph->hdr_checksum = 0;
-			iph->hdr_checksum = rte_ipv4_cksum(iph);
-			ip_len = iph->total_length;
-			ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
-				m->l3_len);
-		} else {
-			ip6h = (struct ipv6_hdr *)iph;
-			ip_paylen = ip6h->payload_len;
+	struct ipv4_hdr *iph, iph_copy;
+	struct ipv6_hdr *ip6h = NULL, ip6h_copy;
+	struct tcp_hdr *th, th_copy;
+	size_t hdrlen = m->l2_len + m->l3_len + m->l4_len;
+	uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
+	uint32_t tmp;
+	int shared = 0;
+
+	/* mbuf is write-only, we need to copy the headers in a linear buffer */
+	if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+		shared = 1;
+
+		/* network headers are too big, there's nothing we can do */
+		if (hdrlen > hdr_sz)
+			return 0;
+
+		rte_pktmbuf_read_copy(m, 0, hdrlen, hdr);
+		iph = (struct ipv4_hdr *)(hdr + m->l2_len);
+		ip6h = (struct ipv6_hdr *)(hdr + m->l2_len);
+		th = (struct tcp_hdr *)(hdr + m->l2_len + m->l3_len);
+	} else {
+		iph = rte_pktmbuf_read(m, m->l2_len, sizeof(*iph), &iph_copy);
+		th = rte_pktmbuf_read(m, m->l2_len + m->l3_len, sizeof(*th),
+			&th_copy);
+	}
+
+	if ((iph->version_ihl >> 4) == 4) {
+		iph->hdr_checksum = 0;
+		iph->hdr_checksum = rte_ipv4_cksum(iph);
+		ip_len = iph->total_length;
+		ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+			m->l3_len);
+	} else {
+		if (!shared) {
+			ip6h = rte_pktmbuf_read(m, m->l2_len, sizeof(*ip6h),
+				&ip6h_copy);
 		}
+		ip_paylen = ip6h->payload_len;
+	}
 
-		/* calculate the new phdr checksum not including ip_paylen */
-		prev_cksum = th->cksum;
-		tmp = prev_cksum;
-		tmp += ip_paylen;
-		tmp = (tmp & 0xffff) + (tmp >> 16);
-		new_cksum = tmp;
+	/* calculate the new phdr checksum not including ip_paylen */
+	prev_cksum = th->cksum;
+	tmp = prev_cksum;
+	tmp += ip_paylen;
+	tmp = (tmp & 0xffff) + (tmp >> 16);
+	new_cksum = tmp;
 
-		/* replace it in the packet */
-		th->cksum = new_cksum;
-	}
+	/* replace it in the header */
+	th->cksum = new_cksum;
+
+	/* the update was done in the linear buffer, return */
+	if (shared)
+		return hdrlen;
+
+	/* copy from local buffer into mbuf if required */
+	if ((iph->version_ihl >> 4) == 4)
+		rte_pktmbuf_write(m, m->l2_len, sizeof(*iph), iph);
+	else
+		rte_pktmbuf_write(m, m->l2_len, sizeof(*ip6h), ip6h);
+	rte_pktmbuf_write(m, m->l2_len + m->l3_len, sizeof(*th), th);
+
+	return 0;
 }
 
 static inline int
@@ -268,7 +298,9 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 	struct vring_desc *start_dp;
 	uint16_t seg_num = cookie->nb_segs;
 	uint16_t head_idx, idx;
+	uint16_t hdr_idx = 0;
 	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	unsigned int offset = 0;
 	struct virtio_net_hdr *hdr;
 	int offload;
 
@@ -303,6 +335,8 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 
 		/* loop below will fill in rest of the indirect elements */
 		start_dp = txr[idx].tx_indir;
+		hdr_idx = 0;
+		start_dp[hdr_idx].len = vq->hw->vtnet_hdr_size;
 		idx = 1;
 	} else {
 		/* setup first tx ring slot to point to header
@@ -313,7 +347,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
 		start_dp[idx].flags = VRING_DESC_F_NEXT;
 		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-
+		hdr_idx = idx;
 		idx = start_dp[idx].next;
 	}
 
@@ -345,7 +379,14 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 
 		/* TCP Segmentation Offload */
 		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			virtio_tso_fix_cksum(cookie);
+			offset = virtio_tso_fix_cksum(cookie,
+				RTE_PTR_ADD(hdr, start_dp[hdr_idx].len),
+				VIRTIO_MAX_HDR_SZ);
+			if (offset > 0) {
+				RTE_ASSERT(can_push != 0);
+				start_dp[hdr_idx].len += offset;
+			}
+
 			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
 				VIRTIO_NET_HDR_GSO_TCPV6 :
 				VIRTIO_NET_HDR_GSO_TCPV4;
@@ -362,10 +403,16 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 	}
 
 	do {
-		start_dp[idx].addr  = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len   = cookie->data_len;
+		if (offset > cookie->data_len) {
+			offset -= cookie->data_len;
+			continue;
+		}
+		start_dp[idx].addr  = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq) +
+			offset;
+		start_dp[idx].len   = cookie->data_len - offset;
 		start_dp[idx].flags = cookie->next ? VRING_DESC_F_NEXT : 0;
 		idx = start_dp[idx].next;
+		offset = 0;
 	} while ((cookie = cookie->next) != NULL);
 
 	if (use_indirect)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index f0bb089..edfe0dd 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -254,8 +254,10 @@ struct virtio_net_hdr_mrg_rxbuf {
 
 /* Region reserved to allow for transmit header and indirect ring */
 #define VIRTIO_MAX_TX_INDIRECT 8
+#define VIRTIO_MAX_HDR_SZ 256
 struct virtio_tx_region {
 	struct virtio_net_hdr_mrg_rxbuf tx_hdr;
+	char net_headers[VIRTIO_MAX_HDR_SZ]; /* for offload if mbuf is RO */
 	struct vring_desc tx_indir[VIRTIO_MAX_TX_INDIRECT]
 			   __attribute__((__aligned__(16)));
 };
-- 
2.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared Olivier Matz
@ 2016-12-14  7:27   ` Yuanhan Liu
  2017-01-09 17:46     ` Olivier Matz
  2017-01-09 17:59   ` Stephen Hemminger
  1 sibling, 1 reply; 14+ messages in thread
From: Yuanhan Liu @ 2016-12-14  7:27 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, maxime.coquelin, huawei.xie, stephen

Hi Olivier,

Firstly sorry for late response!

On Thu, Nov 24, 2016 at 09:56:38AM +0100, Olivier Matz wrote:
> With virtio, doing tso requires to modify the network
> packet data:

I thought more about it this time, and I'm wondering why it's needed.

> - the dpdk API requires to set the l4 checksum to an
>   Intel-Nic-like pseudo header checksum that does
>   not include the ip length

If the packet is for a NIC pmd driver in the end, then the NIC driver
(or application) would handle the checksum correctly.  You could check
the tx_prep patchset for example.

> - the virtio peer expects that the l4 checksum is
>   a standard pseudo header checksum.

For this case, the checksum is then not needed: we could assume the data
between virtio to virtio transmission on the same host is always valid,
that checksum validation is unnecessary.

So, in either case, it doesn't seem to me we have to generate the checksum
here. Or am I miss something?

OTOH, even if it does, I still see some issues (see below).

>  		/* TCP Segmentation Offload */
>  		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
> -			virtio_tso_fix_cksum(cookie);
> +			offset = virtio_tso_fix_cksum(cookie,
> +				RTE_PTR_ADD(hdr, start_dp[hdr_idx].len),
> +				VIRTIO_MAX_HDR_SZ);
> +			if (offset > 0) {
> +				RTE_ASSERT(can_push != 0);

I think it's (can_push == 0) ?

> +				start_dp[hdr_idx].len += offset;

Actually, there is an assumption if you do this, that the backend driver
must have to support ANY_LAYOUT. Otherwise, it won't work: the driver
would expect the header and packet data is totally separated into two
desc buffers.

Though the assumption is most likely true in nowadays, I don't think
it's a guarantee.

	--yliu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2016-12-14  7:27   ` Yuanhan Liu
@ 2017-01-09 17:46     ` Olivier Matz
  2017-01-16  6:48       ` Yuanhan Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Olivier Matz @ 2017-01-09 17:46 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, maxime.coquelin, huawei.xie, stephen, Tan, Jianfeng

Hi Yuanhan,

On Wed, 14 Dec 2016 15:27:50 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> Firstly sorry for late response!

No problem, I fully understand ;)

> On Thu, Nov 24, 2016 at 09:56:38AM +0100, Olivier Matz wrote:
> > With virtio, doing tso requires to modify the network
> > packet data:  
> 
> I thought more about it this time, and I'm wondering why it's needed.
> 
> > - the dpdk API requires to set the l4 checksum to an
> >   Intel-Nic-like pseudo header checksum that does
> >   not include the ip length  
> 
> If the packet is for a NIC pmd driver in the end, then the NIC driver
> (or application) would handle the checksum correctly.  You could check
> the tx_prep patchset for example.
> 
> > - the virtio peer expects that the l4 checksum is
> >   a standard pseudo header checksum.  
> 
> For this case, the checksum is then not needed: we could assume the
> data between virtio to virtio transmission on the same host is always
> valid, that checksum validation is unnecessary.
> 
> So, in either case, it doesn't seem to me we have to generate the
> checksum here. Or am I miss something?

The virtio specifications requires that the L4 checksum is set to the
pseudo header checksum. You can search for "pseudo header" in the
following doc:
http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf

Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
must set the checksum to phdr, and if we do tso, we must set the csum
flag.

We can check that this is really needed with Linux vhost by replaying
the test plan described at [1].

[1] http://dpdk.org/ml/archives/dev/2016-October/048793.html

If we add the following patch to disable the checksum fix (on top of
this patchset), the test1 "large packets (lro/tso)" won't work.

--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -224,6 +224,9 @@
        uint32_t tmp;
        int shared = 0;
 
+        if (1)
+               return 0;
+
        /* mbuf is write-only, we need to copy the headers in a linear
buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
                shared = 1;


In one direction ("flow1" in the test desc), large packets are
transmitted from host on the ixgbe interface, and received by the
guest. Then, testpmd bridges the packet to the virtio interface. But
the packet is not received by the host.

> 
> OTOH, even if it does, I still see some issues (see below).
> 
> >  		/* TCP Segmentation Offload */
> >  		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
> > -			virtio_tso_fix_cksum(cookie);
> > +			offset = virtio_tso_fix_cksum(cookie,
> > +				RTE_PTR_ADD(hdr,
> > start_dp[hdr_idx].len),
> > +				VIRTIO_MAX_HDR_SZ);
> > +			if (offset > 0) {
> > +				RTE_ASSERT(can_push != 0);  
> 
> I think it's (can_push == 0) ?
> 

Yes, indeed. I'll fix that in next version.

> > +				start_dp[hdr_idx].len += offset;  
> 
> Actually, there is an assumption if you do this, that the backend
> driver must have to support ANY_LAYOUT. Otherwise, it won't work: the
> driver would expect the header and packet data is totally separated
> into two desc buffers.
> 
> Though the assumption is most likely true in nowadays, I don't think
> it's a guarantee.

Right.

There are at least 2 options for this one:

- try to use 2 different descriptors (the patch is probably harder,
  and it may slow-down the case where ANY_LAYOUT is supported)

- refuse to initialize with TSO enabled if ANY_LAYOUT is not supported.

If you think ANY_LAYOUT is most likely true today, we could choose
option 2. Let me know what's your preference here.

Thank you for the review.

Olivier

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2016-11-24  8:56 ` [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared Olivier Matz
  2016-12-14  7:27   ` Yuanhan Liu
@ 2017-01-09 17:59   ` Stephen Hemminger
  1 sibling, 0 replies; 14+ messages in thread
From: Stephen Hemminger @ 2017-01-09 17:59 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, yuanhan.liu, maxime.coquelin, huawei.xie

On Thu, 24 Nov 2016 09:56:38 +0100
Olivier Matz <olivier.matz@6wind.com> wrote:

> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index 22d97a4..577c775 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -211,43 +211,73 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)
>  
>  /* When doing TSO, the IP length is not included in the pseudo header
>   * checksum of the packet given to the PMD, but for virtio it is
> - * expected.
> + * expected. Fix the mbuf or a copy if the mbuf is shared.
>   */
> -static void
> -virtio_tso_fix_cksum(struct rte_mbuf *m)
> +static unsigned int
> +virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t hdr_sz)
>  {
> -	/* common case: header is not fragmented */
> -	if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
> -			m->l4_len)) {
> -		struct ipv4_hdr *iph;
> -		struct ipv6_hdr *ip6h;
> -		struct tcp_hdr *th;
> -		uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
> -		uint32_t tmp;
> -
> -		iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
> -		th = RTE_PTR_ADD(iph, m->l3_len);
> -		if ((iph->version_ihl >> 4) == 4) {
> -			iph->hdr_checksum = 0;
> -			iph->hdr_checksum = rte_ipv4_cksum(iph);
> -			ip_len = iph->total_length;
> -			ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
> -				m->l3_len);
> -		} else {
> -			ip6h = (struct ipv6_hdr *)iph;
> -			ip_paylen = ip6h->payload_len;
> +	struct ipv4_hdr *iph, iph_copy;
> +	struct ipv6_hdr *ip6h = NULL, ip6h_copy;
> +	struct tcp_hdr *th, th_copy;
> +	size_t hdrlen = m->l2_len + m->l3_len + m->l4_len;
> +	uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
> +	uint32_t tmp;
> +	int shared = 0;

Minor nit, uses bool instead of int for shared?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2017-01-09 17:46     ` Olivier Matz
@ 2017-01-16  6:48       ` Yuanhan Liu
  2017-01-17 11:18         ` Olivier Matz
  0 siblings, 1 reply; 14+ messages in thread
From: Yuanhan Liu @ 2017-01-16  6:48 UTC (permalink / raw)
  To: Olivier Matz
  Cc: dev, maxime.coquelin, huawei.xie, stephen, Tan, Jianfeng,
	Tomasz Kulasek, Thomas Monjalon, Konstantin Ananyev

On Mon, Jan 09, 2017 at 06:46:25PM +0100, Olivier Matz wrote:
> The virtio specifications requires that the L4 checksum is set to the
> pseudo header checksum. You can search for "pseudo header" in the
> following doc:
> http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf
> 
> Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
> must set the checksum to phdr, and if we do tso, we must set the csum
> flag.
> 
> We can check that this is really needed with Linux vhost by replaying
> the test plan described at [1].
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048793.html
> 
> If we add the following patch to disable the checksum fix (on top of
> this patchset), the test1 "large packets (lro/tso)" won't work.
> 
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -224,6 +224,9 @@
>         uint32_t tmp;
>         int shared = 0;
>  
> +        if (1)
> +               return 0;
> +
>         /* mbuf is write-only, we need to copy the headers in a linear
> buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
>                 shared = 1;
> 
> 
> In one direction ("flow1" in the test desc), large packets are
> transmitted from host on the ixgbe interface, and received by the
> guest. Then, testpmd bridges the packet to the virtio interface. But
> the packet is not received by the host.

I hope I could have time to dig this further, since, honestly, I don't
quite like this patch: it makes things un-maintainable.

Besides that, I think we have similar issue with nic drivers. See the
rte_net_intel_cksum_flags_prepare() function introduced at commit
4fb7e803eb1a ("ethdev: add Tx preparation").

Cc more people here. And here is a quick background for them: NIC drivers
doing TSO need change the mbuf (say, for cksum updating), however, as
Stephen pointed out, we could not do that if the mbuf is shared: I don't
see such checks in the driver code as well.

> There are at least 2 options for this one:
> 
> - try to use 2 different descriptors (the patch is probably harder,
>   and it may slow-down the case where ANY_LAYOUT is supported)
> 
> - refuse to initialize with TSO enabled if ANY_LAYOUT is not supported.
> 
> If you think ANY_LAYOUT is most likely true today, we could choose
> option 2. Let me know what's your preference here.

Maybe we could go with a simpler one: COW. Yeah, it costs more, but this
would be rare, that it should be OKay, right? Besides, we just need copy
the heading mbuf.

	--yliu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2017-01-16  6:48       ` Yuanhan Liu
@ 2017-01-17 11:18         ` Olivier Matz
  2017-01-18  5:03           ` Yuanhan Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Olivier Matz @ 2017-01-17 11:18 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: dev, maxime.coquelin, huawei.xie, stephen, Tan, Jianfeng,
	Tomasz Kulasek, Thomas Monjalon, Konstantin Ananyev

Hi Yuanhan,

On Mon, 16 Jan 2017 14:48:19 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> On Mon, Jan 09, 2017 at 06:46:25PM +0100, Olivier Matz wrote:
> > The virtio specifications requires that the L4 checksum is set to
> > the pseudo header checksum. You can search for "pseudo header" in
> > the following doc:
> > http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf
> > 
> > Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
> > must set the checksum to phdr, and if we do tso, we must set the
> > csum flag.
> > 
> > We can check that this is really needed with Linux vhost by
> > replaying the test plan described at [1].
> > 
> > [1] http://dpdk.org/ml/archives/dev/2016-October/048793.html
> > 
> > If we add the following patch to disable the checksum fix (on top of
> > this patchset), the test1 "large packets (lro/tso)" won't work.
> > 
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -224,6 +224,9 @@
> >         uint32_t tmp;
> >         int shared = 0;
> >  
> > +        if (1)
> > +               return 0;
> > +
> >         /* mbuf is write-only, we need to copy the headers in a
> > linear buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0,
> > hdrlen))) { shared = 1;
> > 
> > 
> > In one direction ("flow1" in the test desc), large packets are
> > transmitted from host on the ixgbe interface, and received by the
> > guest. Then, testpmd bridges the packet to the virtio interface. But
> > the packet is not received by the host.  
> 
> I hope I could have time to dig this further, since, honestly, I don't
> quite like this patch: it makes things un-maintainable.

Well, I'm not that proud of the patch, but that's the best solution
I've found. Nevertheless saying it makes things un-maintainable looks a
bit excessive to me :)

The option of reallocating a mbuf, copy and fix network headers in it
looks even more complex to me (that was my first approach).

> Besides that, I think we have similar issue with nic drivers. See the
> rte_net_intel_cksum_flags_prepare() function introduced at commit
> 4fb7e803eb1a ("ethdev: add Tx preparation").

Yes, that was discussed a bit. See [1] and the subsequent mails.
http://dpdk.org/ml/archives/dev/2016-December/051014.html

My opinion is that tx_burst() should not change the mbuf data, it's
always been like this. For Intel NICs, there is no issue since the DPDK
API is derived from Intel NICs API, so there is no fix to do in the
mbuf data.

For tx_prepare(), it's explicitly said that it can update the data.
If tx_prepare() becomes mandatory, it will naturally fix this issue
without modifying the driver, because the phdr csum calculation will be
done in tx_prepare().

An alternative is to mark this as a known issue for now, and wait until
tx_prepare() is mandatory.


> Cc more people here. And here is a quick background for them: NIC
> drivers doing TSO need change the mbuf (say, for cksum updating),
> however, as Stephen pointed out, we could not do that if the mbuf is
> shared: I don't see such checks in the driver code as well.
> 
> > There are at least 2 options for this one:
> > 
> > - try to use 2 different descriptors (the patch is probably harder,
> >   and it may slow-down the case where ANY_LAYOUT is supported)
> > 
> > - refuse to initialize with TSO enabled if ANY_LAYOUT is not
> > supported.
> > 
> > If you think ANY_LAYOUT is most likely true today, we could choose
> > option 2. Let me know what's your preference here.  
> 
> Maybe we could go with a simpler one: COW. Yeah, it costs more, but
> this would be rare, that it should be OKay, right? Besides, we just
> need copy the heading mbuf.

Could you detail what you mean by COW in this context? Do you mean
reallocating a new mbuf? If yes, it's not only a problem of cost:

- There is no mempool pointer associated to tx queues, so we cannot
  allocate a mbuf. Reusing a mempool pointer from the current mbuf looks
  risky, because it can be a special pool, like a pool dedicated for
  clones, without data.

- It makes allocation error quite hard to manage, it would require some
  rework in tx functions.


Thanks for your review and the discussion. Let me know what you think.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2017-01-17 11:18         ` Olivier Matz
@ 2017-01-18  5:03           ` Yuanhan Liu
  2017-01-24 10:51             ` Olivier MATZ
  0 siblings, 1 reply; 14+ messages in thread
From: Yuanhan Liu @ 2017-01-18  5:03 UTC (permalink / raw)
  To: Olivier Matz
  Cc: dev, maxime.coquelin, huawei.xie, stephen, Tan, Jianfeng,
	Tomasz Kulasek, Thomas Monjalon, Konstantin Ananyev

On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > I hope I could have time to dig this further, since, honestly, I don't
> > quite like this patch: it makes things un-maintainable.
> 
> Well, I'm not that proud of the patch, but that's the best solution
> I've found. Nevertheless saying it makes things un-maintainable looks a
> bit excessive to me :)

Aha... really sorry about that!

But honestly, I'd say again, it makes thing more complex, just for fixing
a corner and rare issue. I'd try to avoid that.

> 
> The option of reallocating a mbuf, copy and fix network headers in it
> looks even more complex to me (that was my first approach).
> 
> > Besides that, I think we have similar issue with nic drivers. See the
> > rte_net_intel_cksum_flags_prepare() function introduced at commit
> > 4fb7e803eb1a ("ethdev: add Tx preparation").
> 
> Yes, that was discussed a bit. See [1] and the subsequent mails.
> http://dpdk.org/ml/archives/dev/2016-December/051014.html

Thanks for the info, and I'm pretty Okay with that.

> My opinion is that tx_burst() should not change the mbuf data, it's
> always been like this. For Intel NICs, there is no issue since the DPDK
> API is derived from Intel NICs API, so there is no fix to do in the
> mbuf data.
> 
> For tx_prepare(), it's explicitly said that it can update the data.
> If tx_prepare() becomes mandatory, it will naturally fix this issue
> without modifying the driver, because the phdr csum calculation will be
> done in tx_prepare().
> 
> An alternative is to mark this as a known issue for now, and wait until
> tx_prepare() is mandatory.

I see no reason to wait. Though my understanding is it may not be a
mandatory so far, but user is supposed to calculate the pseudo-header
checksum by themself before. Now they have one more option: tx_prepare.

That means, in either way, user has to do some extra works to make
TSO work (either by themself or call tx_prepare). So I don't think
it'd be an issue?

	--yliu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2017-01-18  5:03           ` Yuanhan Liu
@ 2017-01-24 10:51             ` Olivier MATZ
  2017-01-28 12:32               ` Yuanhan Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Olivier MATZ @ 2017-01-24 10:51 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: Olivier Matz, dev, maxime.coquelin, huawei.xie, stephen, Tan,
	Jianfeng, Tomasz Kulasek, Thomas Monjalon, Konstantin Ananyev

On Wed, 18 Jan 2017 13:03:48 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > > I hope I could have time to dig this further, since, honestly, I
> > > don't quite like this patch: it makes things un-maintainable.  
> > 
> > Well, I'm not that proud of the patch, but that's the best solution
> > I've found. Nevertheless saying it makes things un-maintainable
> > looks a bit excessive to me :)  
> 
> Aha... really sorry about that!
> 
> But honestly, I'd say again, it makes thing more complex, just for
> fixing a corner and rare issue. I'd try to avoid that.
> 
> > 
> > The option of reallocating a mbuf, copy and fix network headers in
> > it looks even more complex to me (that was my first approach).
> >   
> > > Besides that, I think we have similar issue with nic drivers. See
> > > the rte_net_intel_cksum_flags_prepare() function introduced at
> > > commit 4fb7e803eb1a ("ethdev: add Tx preparation").  
> > 
> > Yes, that was discussed a bit. See [1] and the subsequent mails.
> > http://dpdk.org/ml/archives/dev/2016-December/051014.html  
> 
> Thanks for the info, and I'm pretty Okay with that.
> 
> > My opinion is that tx_burst() should not change the mbuf data, it's
> > always been like this. For Intel NICs, there is no issue since the
> > DPDK API is derived from Intel NICs API, so there is no fix to do
> > in the mbuf data.
> > 
> > For tx_prepare(), it's explicitly said that it can update the data.
> > If tx_prepare() becomes mandatory, it will naturally fix this issue
> > without modifying the driver, because the phdr csum calculation
> > will be done in tx_prepare().
> > 
> > An alternative is to mark this as a known issue for now, and wait
> > until tx_prepare() is mandatory.  
> 
> I see no reason to wait. Though my understanding is it may not be a
> mandatory so far, but user is supposed to calculate the pseudo-header
> checksum by themself before. Now they have one more option:
> tx_prepare.
>
> That means, in either way, user has to do some extra works to make
> TSO work (either by themself or call tx_prepare). So I don't think
> it'd be an issue?

Yes sounds good. I'll check how a tx_prepare() could be implemented for
virtio, in order to fix this issue.

In the meanwhile, for the 17.02, I think it could be good to highlight
the problem in the known issues, what do you think?


Thanks
Olivier

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared
  2017-01-24 10:51             ` Olivier MATZ
@ 2017-01-28 12:32               ` Yuanhan Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Yuanhan Liu @ 2017-01-28 12:32 UTC (permalink / raw)
  To: Olivier MATZ
  Cc: dev, maxime.coquelin, huawei.xie, stephen, Tan, Jianfeng,
	Tomasz Kulasek, Thomas Monjalon, Konstantin Ananyev

On Tue, Jan 24, 2017 at 11:51:20AM +0100, Olivier MATZ wrote:
> On Wed, 18 Jan 2017 13:03:48 +0800, Yuanhan Liu
> <yuanhan.liu@linux.intel.com> wrote:
> > On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > > > I hope I could have time to dig this further, since, honestly, I
> > > > don't quite like this patch: it makes things un-maintainable.  
> > > 
> > > Well, I'm not that proud of the patch, but that's the best solution
> > > I've found. Nevertheless saying it makes things un-maintainable
> > > looks a bit excessive to me :)  
> > 
> > Aha... really sorry about that!
> > 
> > But honestly, I'd say again, it makes thing more complex, just for
> > fixing a corner and rare issue. I'd try to avoid that.
> > 
> > > 
> > > The option of reallocating a mbuf, copy and fix network headers in
> > > it looks even more complex to me (that was my first approach).
> > >   
> > > > Besides that, I think we have similar issue with nic drivers. See
> > > > the rte_net_intel_cksum_flags_prepare() function introduced at
> > > > commit 4fb7e803eb1a ("ethdev: add Tx preparation").  
> > > 
> > > Yes, that was discussed a bit. See [1] and the subsequent mails.
> > > http://dpdk.org/ml/archives/dev/2016-December/051014.html  
> > 
> > Thanks for the info, and I'm pretty Okay with that.
> > 
> > > My opinion is that tx_burst() should not change the mbuf data, it's
> > > always been like this. For Intel NICs, there is no issue since the
> > > DPDK API is derived from Intel NICs API, so there is no fix to do
> > > in the mbuf data.
> > > 
> > > For tx_prepare(), it's explicitly said that it can update the data.
> > > If tx_prepare() becomes mandatory, it will naturally fix this issue
> > > without modifying the driver, because the phdr csum calculation
> > > will be done in tx_prepare().
> > > 
> > > An alternative is to mark this as a known issue for now, and wait
> > > until tx_prepare() is mandatory.  
> > 
> > I see no reason to wait. Though my understanding is it may not be a
> > mandatory so far, but user is supposed to calculate the pseudo-header
> > checksum by themself before. Now they have one more option:
> > tx_prepare.
> >
> > That means, in either way, user has to do some extra works to make
> > TSO work (either by themself or call tx_prepare). So I don't think
> > it'd be an issue?
> 
> Yes sounds good. I'll check how a tx_prepare() could be implemented for
> virtio, in order to fix this issue.

Thanks.

> In the meanwhile, for the 17.02, I think it could be good to highlight
> the problem in the known issues, what do you think?

Yes, I think it's a good idea.

	--yliu

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-01-28 12:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-24  8:56 [dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs Olivier Matz
2016-11-24  8:56 ` [dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function Olivier Matz
2016-11-24  8:56 ` [dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared Olivier Matz
2016-11-24  8:56 ` [dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain Olivier Matz
2016-11-24  8:56 ` [dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf Olivier Matz
2016-11-24  8:56 ` [dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared Olivier Matz
2016-12-14  7:27   ` Yuanhan Liu
2017-01-09 17:46     ` Olivier Matz
2017-01-16  6:48       ` Yuanhan Liu
2017-01-17 11:18         ` Olivier Matz
2017-01-18  5:03           ` Yuanhan Liu
2017-01-24 10:51             ` Olivier MATZ
2017-01-28 12:32               ` Yuanhan Liu
2017-01-09 17:59   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).