* [dpdk-dev] [PATCH V1 0/3] Fix external mbuf free issue in GSO case
@ 2020-07-30 12:08 yang_y_yi
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 1/3] gso: fix refcnt update issue for external mbuf yang_y_yi
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: yang_y_yi @ 2020-07-30 12:08 UTC (permalink / raw)
  To: dev; +Cc: jiayu.hu, thomas, yangyi01, yang_y_yi
From: Yi Yang <yangyi01@inspur.com>
It is impossible to fix external mbuf free issue in GSO case,
the issue is GSO case only can free buffer in external mbuf
but can't free this external mbuf because this is done by free_cb,
in order to fix it, free_cb interface has to been changed to
adapt to GSO case, this patch series are just to fix it
completely, OVS DPDK is typical consumer application.
Yi Yang (3):
  gso: fix refcnt update issue for external mbuf
  mbuf: change free_cb interface to adapt to GSO case
  vhost: use new free_cb interface to fix mbuf free issue
 app/test-compress-perf/comp_perf_test_common.c |  2 +-
 app/test/test_compressdev.c                    |  2 +-
 app/test/test_mbuf.c                           |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c                   |  2 +-
 drivers/net/mlx5/mlx5_rxtx.h                   |  2 +-
 drivers/net/netvsc/hn_rxtx.c                   |  3 ++-
 lib/librte_gso/rte_gso.c                       |  5 ++++-
 lib/librte_mbuf/rte_mbuf.c                     |  4 ++--
 lib/librte_mbuf/rte_mbuf.h                     |  2 +-
 lib/librte_mbuf/rte_mbuf_core.h                |  2 +-
 lib/librte_vhost/virtio_net.c                  | 22 +++++++++++++++++++---
 11 files changed, 34 insertions(+), 14 deletions(-)
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 21+ messages in thread
* [dpdk-dev] [PATCH V1 1/3] gso: fix refcnt update issue for external mbuf
  2020-07-30 12:08 [dpdk-dev] [PATCH V1 0/3] Fix external mbuf free issue in GSO case yang_y_yi
@ 2020-07-30 12:08 ` yang_y_yi
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case yang_y_yi
  2020-07-30 12:09 ` [dpdk-dev] [PATCH V1 3/3] vhost: use new free_cb interface to fix mbuf free issue yang_y_yi
  2 siblings, 0 replies; 21+ messages in thread
From: yang_y_yi @ 2020-07-30 12:08 UTC (permalink / raw)
  To: dev; +Cc: jiayu.hu, thomas, yangyi01, yang_y_yi
From: Yi Yang <yangyi01@inspur.com>
rte_gso_segment will segment original mbuf to multiple
small size mbufs, every mbuf of them has two segments,
the second segment will attach to original mbuf, if
original mbuf is external, rte_gso_segment should update
refcnt of mbuf->shinfo but not refcnt of mbuf. Otherwise,
original mbuf will be invalid on doing sanity check
because refcnt of mbuf is 0.
Fixes: 119583797b6a ("gso: support TCP/IPv4 GSO")
Signed-off-by: Yi Yang <yangyi01@inspur.com>
---
 lib/librte_gso/rte_gso.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/lib/librte_gso/rte_gso.c b/lib/librte_gso/rte_gso.c
index 751b5b6..b5e8b2a 100644
--- a/lib/librte_gso/rte_gso.c
+++ b/lib/librte_gso/rte_gso.c
@@ -83,7 +83,10 @@
 	if (ret > 1) {
 		pkt_seg = pkt;
 		while (pkt_seg) {
-			rte_mbuf_refcnt_update(pkt_seg, -1);
+			if (RTE_MBUF_HAS_EXTBUF(pkt_seg))
+				rte_mbuf_ext_refcnt_update(pkt_seg->shinfo, -1);
+			else
+				rte_mbuf_refcnt_update(pkt_seg, -1);
 			pkt_seg = pkt_seg->next;
 		}
 	} else if (ret < 0) {
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 21+ messages in thread
* [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-07-30 12:08 [dpdk-dev] [PATCH V1 0/3] Fix external mbuf free issue in GSO case yang_y_yi
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 1/3] gso: fix refcnt update issue for external mbuf yang_y_yi
@ 2020-07-30 12:08 ` yang_y_yi
  2020-07-31 15:15   ` Olivier Matz
  2020-07-30 12:09 ` [dpdk-dev] [PATCH V1 3/3] vhost: use new free_cb interface to fix mbuf free issue yang_y_yi
  2 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-07-30 12:08 UTC (permalink / raw)
  To: dev; +Cc: jiayu.hu, thomas, yangyi01, yang_y_yi
From: Yi Yang <yangyi01@inspur.com>
In GSO case, segmented mbufs are attached to original
mbuf which can't be freed when it is external. The issue
is free_cb doesn't know original mbuf and doesn't free
it when refcnt of shinfo is 0.
Original mbuf can be freed by rte_pktmbuf_free segmented
mbufs or by rte_pktmbuf_free original mbuf. Two kind of
cases should have different behaviors. free_cb won't
explicitly call rte_pktmbuf_free to free original mbuf
if it is freed by rte_pktmbuf_free original mbuf, but it
has to call rte_pktmbuf_free to free original mbuf if it
is freed by rte_pktmbuf_free segmented mbufs.
In order to fix this issue, free_cb interface has to been
changed, __rte_pktmbuf_free_extbuf must deliver called
mbuf pointer to free_cb, argument opaque can be defined
as a custom struct by user, it can includes original mbuf
pointer, user-defined free_cb can compare caller mbuf with
mbuf in opaque struct, free_cb should free original mbuf
if they are not same, this corresponds to rte_pktmbuf_free
segmented mbufs case, otherwise, free_cb won't free original
mbuf because the caller explicitly called rte_pktmbuf_free
to free it.
Here is pseduo code to show two kind of cases.
case 1. rte_pktmbuf_free segmented mbufs
nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
                      &gso_ctx,
                      /* segmented mbuf */
                      (struct rte_mbuf **)&gso_mbufs,
                      MAX_GSO_MBUFS);
rte_eth_tx_burst(dev->port_id, qid, gso_mbufs, nb_tx);
case 2. rte_pktmbuf_free original mbuf
rte_eth_tx_burst(dev->port_id, qid, original_mbuf, 1);
Here are user defined free_cb and opaque.
struct shinfo_arg {
	void *buf;
	struct rte_mbuf *mbuf;
};
custom_free_cb(struct rte_mbuf *caller_m, void *opaque)
{
	struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
	rte_free(arg->buf);
	if (caller_m != arg->mbuf)
		rte_pktmbuf_free(arg->mbuf);
	rte_free(arg);
}
struct shinfo_arg *arg = (struct shinfo_arg *)
rte_malloc(NULL, sizeof(struct shinfo_arg),
	   RTE_CACHE_LINE_SIZE);
arg->buf = buf;
arg->mbuf = pkt;
shinfo->free_cb = custom_free_cb;
shinfo->fcb_opaque = arg;
Signed-off-by: Yi Yang <yangyi01@inspur.com>
---
 app/test-compress-perf/comp_perf_test_common.c | 2 +-
 app/test/test_compressdev.c                    | 2 +-
 app/test/test_mbuf.c                           | 2 +-
 drivers/net/mlx5/mlx5_rxtx.c                   | 2 +-
 drivers/net/mlx5/mlx5_rxtx.h                   | 2 +-
 drivers/net/netvsc/hn_rxtx.c                   | 3 ++-
 lib/librte_mbuf/rte_mbuf.c                     | 4 ++--
 lib/librte_mbuf/rte_mbuf.h                     | 2 +-
 lib/librte_mbuf/rte_mbuf_core.h                | 2 +-
 lib/librte_vhost/virtio_net.c                  | 2 +-
 10 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/app/test-compress-perf/comp_perf_test_common.c b/app/test-compress-perf/comp_perf_test_common.c
index b402a0d..b1eb733 100644
--- a/app/test-compress-perf/comp_perf_test_common.c
+++ b/app/test-compress-perf/comp_perf_test_common.c
@@ -115,7 +115,7 @@ struct cperf_buffer_info {
 }
 
 static void
-comp_perf_extbuf_free_cb(void *addr __rte_unused, void *opaque __rte_unused)
+comp_perf_extbuf_free_cb(struct rte_mbuf *m __rte_unused, void *opaque __rte_unused)
 {
 }
 
diff --git a/app/test/test_compressdev.c b/app/test/test_compressdev.c
index 0571c17..a4a5d7c 100644
--- a/app/test/test_compressdev.c
+++ b/app/test/test_compressdev.c
@@ -778,7 +778,7 @@ struct test_private_arrays {
 }
 
 static void
-extbuf_free_callback(void *addr __rte_unused, void *opaque __rte_unused)
+extbuf_free_callback(struct rte_mbuf *m __rte_unused, void *opaque __rte_unused)
 {
 }
 
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 06e44f0..f9e5ca5 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2300,7 +2300,7 @@ struct test_case {
 
 /* Define a free call back function to be used for external buffer */
 static void
-ext_buf_free_callback_fn(void *addr __rte_unused, void *opaque)
+ext_buf_free_callback_fn(struct rte_mbuf *caller_m __rte_unused, void *opaque)
 {
 	void *ext_buf_addr = opaque;
 
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 3eb0243..f084488 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1622,7 +1622,7 @@ enum mlx5_txcmp_code {
 }
 
 void
-mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
+mlx5_mprq_buf_free_cb(struct rte_mbuf *caller_m __rte_unused, void *opaque)
 {
 	struct mlx5_mprq_buf *buf = opaque;
 
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index c02a007..8c230c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -499,7 +499,7 @@ struct mlx5_txq_ctrl *mlx5_txq_hairpin_new
 uint16_t mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n);
 void mlx5_rxq_initialize(struct mlx5_rxq_data *rxq);
 __rte_noinline int mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec);
-void mlx5_mprq_buf_free_cb(void *addr, void *opaque);
+void mlx5_mprq_buf_free_cb(struct rte_mbuf *caller_m, void *opaque);
 void mlx5_mprq_buf_free(struct mlx5_mprq_buf *buf);
 uint16_t mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts,
 			    uint16_t pkts_n);
diff --git a/drivers/net/netvsc/hn_rxtx.c b/drivers/net/netvsc/hn_rxtx.c
index 86a4c0d..30af404 100644
--- a/drivers/net/netvsc/hn_rxtx.c
+++ b/drivers/net/netvsc/hn_rxtx.c
@@ -519,7 +519,8 @@ int hn_dev_tx_descriptor_status(void *arg, uint16_t offset)
 	return 0;
 }
 
-static void hn_rx_buf_free_cb(void *buf __rte_unused, void *opaque)
+static void hn_rx_buf_free_cb(struct rte_mbuf *caller_m __rte_unused,
+			      void *opaque)
 {
 	struct hn_rx_bufinfo *rxb = opaque;
 	struct hn_data *hv = rxb->hv;
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 8a456e5..52445b3 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -118,11 +118,11 @@
  * buffer.
  */
 static void
-rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
+rte_pktmbuf_free_pinned_extmem(struct rte_mbuf *caller_m, void *opaque)
 {
 	struct rte_mbuf *m = opaque;
 
-	RTE_SET_USED(addr);
+	RTE_SET_USED(caller_m);
 	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
 	RTE_ASSERT(RTE_MBUF_HAS_PINNED_EXTBUF(m));
 	RTE_ASSERT(m->shinfo->fcb_opaque == m);
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 7259575..5f8626f 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1193,7 +1193,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	RTE_ASSERT(m->shinfo != NULL);
 
 	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
-		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+		m->shinfo->free_cb(m, m->shinfo->fcb_opaque);
 }
 
 /**
diff --git a/lib/librte_mbuf/rte_mbuf_core.h b/lib/librte_mbuf/rte_mbuf_core.h
index 8cd7137..d194429 100644
--- a/lib/librte_mbuf/rte_mbuf_core.h
+++ b/lib/librte_mbuf/rte_mbuf_core.h
@@ -671,7 +671,7 @@ struct rte_mbuf {
 /**
  * Function typedef of callback to free externally attached buffer.
  */
-typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+typedef void (*rte_mbuf_extbuf_free_callback_t)(struct rte_mbuf *, void *);
 
 /**
  * Shared data at the end of an external buffer.
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 95a0bc1..e663fd4 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2137,7 +2137,7 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 }
 
 static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
 {
 	rte_free(opaque);
 }
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 21+ messages in thread
* [dpdk-dev] [PATCH V1 3/3] vhost: use new free_cb interface to fix mbuf free issue
  2020-07-30 12:08 [dpdk-dev] [PATCH V1 0/3] Fix external mbuf free issue in GSO case yang_y_yi
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 1/3] gso: fix refcnt update issue for external mbuf yang_y_yi
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case yang_y_yi
@ 2020-07-30 12:09 ` yang_y_yi
  2 siblings, 0 replies; 21+ messages in thread
From: yang_y_yi @ 2020-07-30 12:09 UTC (permalink / raw)
  To: dev; +Cc: jiayu.hu, thomas, yangyi01, yang_y_yi
From: Yi Yang <yangyi01@inspur.com>
New free_cb interface can help fix original external
mbuf free issue in GSO case, it still can be compatible
with normal non-GSO case. dpdkvhostuser port can work
both in GSO case and in non-GSO case by this fix.
Signed-off-by: Yi Yang <yangyi01@inspur.com>
---
 lib/librte_vhost/virtio_net.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index e663fd4..3b69cbb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2136,10 +2136,20 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 	return NULL;
 }
 
+struct shinfo_arg {
+	void *buf;
+	struct rte_mbuf *mbuf;
+};
+
 static void
-virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
+virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
 {
-	rte_free(opaque);
+	struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
+
+	rte_free(arg->buf);
+	if (caller_m != arg->mbuf)
+		rte_pktmbuf_free(arg->mbuf);
+	rte_free(arg);
 }
 
 static int
@@ -2172,8 +2182,14 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 	/* Initialize shinfo */
 	if (shinfo) {
+		struct shinfo_arg *arg = (struct shinfo_arg *)
+			rte_malloc(NULL, sizeof(struct shinfo_arg),
+				   RTE_CACHE_LINE_SIZE);
+
+		arg->buf = buf;
+		arg->mbuf = pkt;
 		shinfo->free_cb = virtio_dev_extbuf_free;
-		shinfo->fcb_opaque = buf;
+		shinfo->fcb_opaque = arg;
 		rte_mbuf_ext_refcnt_set(shinfo, 1);
 	} else {
 		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case yang_y_yi
@ 2020-07-31 15:15   ` Olivier Matz
  2020-08-01 23:12     ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-07-31 15:15 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi,
On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> In GSO case, segmented mbufs are attached to original
> mbuf which can't be freed when it is external. The issue
> is free_cb doesn't know original mbuf and doesn't free
> it when refcnt of shinfo is 0.
> 
> Original mbuf can be freed by rte_pktmbuf_free segmented
> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> cases should have different behaviors. free_cb won't
> explicitly call rte_pktmbuf_free to free original mbuf
> if it is freed by rte_pktmbuf_free original mbuf, but it
> has to call rte_pktmbuf_free to free original mbuf if it
> is freed by rte_pktmbuf_free segmented mbufs.
> 
> In order to fix this issue, free_cb interface has to been
> changed, __rte_pktmbuf_free_extbuf must deliver called
> mbuf pointer to free_cb, argument opaque can be defined
> as a custom struct by user, it can includes original mbuf
> pointer, user-defined free_cb can compare caller mbuf with
> mbuf in opaque struct, free_cb should free original mbuf
> if they are not same, this corresponds to rte_pktmbuf_free
> segmented mbufs case, otherwise, free_cb won't free original
> mbuf because the caller explicitly called rte_pktmbuf_free
> to free it.
> 
> Here is pseduo code to show two kind of cases.
> 
> case 1. rte_pktmbuf_free segmented mbufs
> 
> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>                       &gso_ctx,
>                       /* segmented mbuf */
>                       (struct rte_mbuf **)&gso_mbufs,
>                       MAX_GSO_MBUFS);
I'm sorry but it is not very clear to me what operations are done by
rte_gso_segment().
In the current code, I only see calls to rte_pktmbuf_attach(),
which do not deal with external buffers. Am I missing something?
Are you able to show the issue only with mbuf functions? It would
be helpful to understand what does not work.
Thanks,
Olivier
> rte_eth_tx_burst(dev->port_id, qid, gso_mbufs, nb_tx);
> 
> case 2. rte_pktmbuf_free original mbuf
> 
> rte_eth_tx_burst(dev->port_id, qid, original_mbuf, 1);
> 
> Here are user defined free_cb and opaque.
> 
> struct shinfo_arg {
> 	void *buf;
> 	struct rte_mbuf *mbuf;
> };
> 
> custom_free_cb(struct rte_mbuf *caller_m, void *opaque)
> {
> 	struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> 
> 	rte_free(arg->buf);
> 	if (caller_m != arg->mbuf)
> 		rte_pktmbuf_free(arg->mbuf);
> 	rte_free(arg);
> }
> 
> struct shinfo_arg *arg = (struct shinfo_arg *)
> rte_malloc(NULL, sizeof(struct shinfo_arg),
> 	   RTE_CACHE_LINE_SIZE);
> 
> arg->buf = buf;
> arg->mbuf = pkt;
> shinfo->free_cb = custom_free_cb;
> shinfo->fcb_opaque = arg;
> 
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> ---
>  app/test-compress-perf/comp_perf_test_common.c | 2 +-
>  app/test/test_compressdev.c                    | 2 +-
>  app/test/test_mbuf.c                           | 2 +-
>  drivers/net/mlx5/mlx5_rxtx.c                   | 2 +-
>  drivers/net/mlx5/mlx5_rxtx.h                   | 2 +-
>  drivers/net/netvsc/hn_rxtx.c                   | 3 ++-
>  lib/librte_mbuf/rte_mbuf.c                     | 4 ++--
>  lib/librte_mbuf/rte_mbuf.h                     | 2 +-
>  lib/librte_mbuf/rte_mbuf_core.h                | 2 +-
>  lib/librte_vhost/virtio_net.c                  | 2 +-
>  10 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/app/test-compress-perf/comp_perf_test_common.c b/app/test-compress-perf/comp_perf_test_common.c
> index b402a0d..b1eb733 100644
> --- a/app/test-compress-perf/comp_perf_test_common.c
> +++ b/app/test-compress-perf/comp_perf_test_common.c
> @@ -115,7 +115,7 @@ struct cperf_buffer_info {
>  }
>  
>  static void
> -comp_perf_extbuf_free_cb(void *addr __rte_unused, void *opaque __rte_unused)
> +comp_perf_extbuf_free_cb(struct rte_mbuf *m __rte_unused, void *opaque __rte_unused)
>  {
>  }
>  
> diff --git a/app/test/test_compressdev.c b/app/test/test_compressdev.c
> index 0571c17..a4a5d7c 100644
> --- a/app/test/test_compressdev.c
> +++ b/app/test/test_compressdev.c
> @@ -778,7 +778,7 @@ struct test_private_arrays {
>  }
>  
>  static void
> -extbuf_free_callback(void *addr __rte_unused, void *opaque __rte_unused)
> +extbuf_free_callback(struct rte_mbuf *m __rte_unused, void *opaque __rte_unused)
>  {
>  }
>  
> diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
> index 06e44f0..f9e5ca5 100644
> --- a/app/test/test_mbuf.c
> +++ b/app/test/test_mbuf.c
> @@ -2300,7 +2300,7 @@ struct test_case {
>  
>  /* Define a free call back function to be used for external buffer */
>  static void
> -ext_buf_free_callback_fn(void *addr __rte_unused, void *opaque)
> +ext_buf_free_callback_fn(struct rte_mbuf *caller_m __rte_unused, void *opaque)
>  {
>  	void *ext_buf_addr = opaque;
>  
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index 3eb0243..f084488 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1622,7 +1622,7 @@ enum mlx5_txcmp_code {
>  }
>  
>  void
> -mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
> +mlx5_mprq_buf_free_cb(struct rte_mbuf *caller_m __rte_unused, void *opaque)
>  {
>  	struct mlx5_mprq_buf *buf = opaque;
>  
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index c02a007..8c230c6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -499,7 +499,7 @@ struct mlx5_txq_ctrl *mlx5_txq_hairpin_new
>  uint16_t mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n);
>  void mlx5_rxq_initialize(struct mlx5_rxq_data *rxq);
>  __rte_noinline int mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec);
> -void mlx5_mprq_buf_free_cb(void *addr, void *opaque);
> +void mlx5_mprq_buf_free_cb(struct rte_mbuf *caller_m, void *opaque);
>  void mlx5_mprq_buf_free(struct mlx5_mprq_buf *buf);
>  uint16_t mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts,
>  			    uint16_t pkts_n);
> diff --git a/drivers/net/netvsc/hn_rxtx.c b/drivers/net/netvsc/hn_rxtx.c
> index 86a4c0d..30af404 100644
> --- a/drivers/net/netvsc/hn_rxtx.c
> +++ b/drivers/net/netvsc/hn_rxtx.c
> @@ -519,7 +519,8 @@ int hn_dev_tx_descriptor_status(void *arg, uint16_t offset)
>  	return 0;
>  }
>  
> -static void hn_rx_buf_free_cb(void *buf __rte_unused, void *opaque)
> +static void hn_rx_buf_free_cb(struct rte_mbuf *caller_m __rte_unused,
> +			      void *opaque)
>  {
>  	struct hn_rx_bufinfo *rxb = opaque;
>  	struct hn_data *hv = rxb->hv;
> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
> index 8a456e5..52445b3 100644
> --- a/lib/librte_mbuf/rte_mbuf.c
> +++ b/lib/librte_mbuf/rte_mbuf.c
> @@ -118,11 +118,11 @@
>   * buffer.
>   */
>  static void
> -rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
> +rte_pktmbuf_free_pinned_extmem(struct rte_mbuf *caller_m, void *opaque)
>  {
>  	struct rte_mbuf *m = opaque;
>  
> -	RTE_SET_USED(addr);
> +	RTE_SET_USED(caller_m);
>  	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
>  	RTE_ASSERT(RTE_MBUF_HAS_PINNED_EXTBUF(m));
>  	RTE_ASSERT(m->shinfo->fcb_opaque == m);
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 7259575..5f8626f 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1193,7 +1193,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>  	RTE_ASSERT(m->shinfo != NULL);
>  
>  	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> -		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +		m->shinfo->free_cb(m, m->shinfo->fcb_opaque);
>  }
>  
>  /**
> diff --git a/lib/librte_mbuf/rte_mbuf_core.h b/lib/librte_mbuf/rte_mbuf_core.h
> index 8cd7137..d194429 100644
> --- a/lib/librte_mbuf/rte_mbuf_core.h
> +++ b/lib/librte_mbuf/rte_mbuf_core.h
> @@ -671,7 +671,7 @@ struct rte_mbuf {
>  /**
>   * Function typedef of callback to free externally attached buffer.
>   */
> -typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(struct rte_mbuf *, void *);
>  
>  /**
>   * Shared data at the end of an external buffer.
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 95a0bc1..e663fd4 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -2137,7 +2137,7 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>  }
>  
>  static void
> -virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
> +virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
>  {
>  	rte_free(opaque);
>  }
> -- 
> 1.8.3.1
> 
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-07-31 15:15   ` Olivier Matz
@ 2020-08-01 23:12     ` yang_y_yi
  2020-08-02 20:29       ` Olivier Matz
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-08-01 23:12 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>Hi,
>
>On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> From: Yi Yang <yangyi01@inspur.com>
>> 
>> In GSO case, segmented mbufs are attached to original
>> mbuf which can't be freed when it is external. The issue
>> is free_cb doesn't know original mbuf and doesn't free
>> it when refcnt of shinfo is 0.
>> 
>> Original mbuf can be freed by rte_pktmbuf_free segmented
>> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> cases should have different behaviors. free_cb won't
>> explicitly call rte_pktmbuf_free to free original mbuf
>> if it is freed by rte_pktmbuf_free original mbuf, but it
>> has to call rte_pktmbuf_free to free original mbuf if it
>> is freed by rte_pktmbuf_free segmented mbufs.
>> 
>> In order to fix this issue, free_cb interface has to been
>> changed, __rte_pktmbuf_free_extbuf must deliver called
>> mbuf pointer to free_cb, argument opaque can be defined
>> as a custom struct by user, it can includes original mbuf
>> pointer, user-defined free_cb can compare caller mbuf with
>> mbuf in opaque struct, free_cb should free original mbuf
>> if they are not same, this corresponds to rte_pktmbuf_free
>> segmented mbufs case, otherwise, free_cb won't free original
>> mbuf because the caller explicitly called rte_pktmbuf_free
>> to free it.
>> 
>> Here is pseduo code to show two kind of cases.
>> 
>> case 1. rte_pktmbuf_free segmented mbufs
>> 
>> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>>                       &gso_ctx,
>>                       /* segmented mbuf */
>>                       (struct rte_mbuf **)&gso_mbufs,
>>                       MAX_GSO_MBUFS);
>
>I'm sorry but it is not very clear to me what operations are done by
>rte_gso_segment().
>
>In the current code, I only see calls to rte_pktmbuf_attach(),
>which do not deal with external buffers. Am I missing something?
>
>Are you able to show the issue only with mbuf functions? It would
>be helpful to understand what does not work.
>
>
>Thanks,
>Olivier
>
Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index e663fd4..3b69cbb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2136,10 +2136,20 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
                return NULL;
 }
 
+struct shinfo_arg {
+             void *buf;
+             struct rte_mbuf *mbuf;
+};
+
 static void
-virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
+virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
 {
-              rte_free(opaque);
+             struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
+
+             rte_free(arg->buf);
+             if (caller_m != arg->mbuf)
+                             rte_pktmbuf_free(arg->mbuf);
+             rte_free(arg);
 }
 
 static int
@@ -2172,8 +2182,14 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
                /* Initialize shinfo */
                if (shinfo) {
+                             struct shinfo_arg *arg = (struct shinfo_arg *)
+                                             rte_malloc(NULL, sizeof(struct shinfo_arg),
+                                                                RTE_CACHE_LINE_SIZE);
+
+                             arg->buf = buf;
+                             arg->mbuf = pkt;
                                shinfo->free_cb = virtio_dev_extbuf_free;
-                              shinfo->fcb_opaque = buf;
+                             shinfo->fcb_opaque = arg;
                                rte_mbuf_ext_refcnt_set(shinfo, 1);
                } else {
                                shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
--
1.8.3.1
/*
 * Allocate a host supported pktmbuf.
 */
static __rte_always_inline struct rte_mbuf *
virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
                         uint32_t data_len)
{
        struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
        if (unlikely(pkt == NULL)) {
                VHOST_LOG_DATA(ERR,
                        "Failed to allocate memory for mbuf.\n");
                return NULL;
        }
        if (rte_pktmbuf_tailroom(pkt) >= data_len)
                return pkt;
        /* attach an external buffer if supported */
        if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
                return pkt;
        /* check if chained buffers are allowed */
        if (!dev->linearbuf)
                return pkt;
        /* Data doesn't fit into the buffer and the host supports
         * only linear buffers
         */
        rte_pktmbuf_free(pkt);
        return NULL;
}
 
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-01 23:12     ` yang_y_yi
@ 2020-08-02 20:29       ` Olivier Matz
  2020-08-02 20:45         ` Olivier Matz
  2020-08-03  1:26         ` yang_y_yi
  0 siblings, 2 replies; 21+ messages in thread
From: Olivier Matz @ 2020-08-02 20:29 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi,
On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> 
> 
> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >Hi,
> >
> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> From: Yi Yang <yangyi01@inspur.com>
> >> 
> >> In GSO case, segmented mbufs are attached to original
> >> mbuf which can't be freed when it is external. The issue
> >> is free_cb doesn't know original mbuf and doesn't free
> >> it when refcnt of shinfo is 0.
> >> 
> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> cases should have different behaviors. free_cb won't
> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> 
> >> In order to fix this issue, free_cb interface has to been
> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> mbuf pointer to free_cb, argument opaque can be defined
> >> as a custom struct by user, it can includes original mbuf
> >> pointer, user-defined free_cb can compare caller mbuf with
> >> mbuf in opaque struct, free_cb should free original mbuf
> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> segmented mbufs case, otherwise, free_cb won't free original
> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> to free it.
> >> 
> >> Here is pseduo code to show two kind of cases.
> >> 
> >> case 1. rte_pktmbuf_free segmented mbufs
> >> 
> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >>                       &gso_ctx,
> >>                       /* segmented mbuf */
> >>                       (struct rte_mbuf **)&gso_mbufs,
> >>                       MAX_GSO_MBUFS);
> >
> >I'm sorry but it is not very clear to me what operations are done by
> >rte_gso_segment().
> >
> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >which do not deal with external buffers. Am I missing something?
> >
> >Are you able to show the issue only with mbuf functions? It would
> >be helpful to understand what does not work.
> >
> >
> >Thanks,
> >Olivier
> >
> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
I'm sill not sure to get your issue. Please, if you have a simple test
case using only mbufs functions (without virtio, gso, ...), it would be
very helpful because we will be sure that we are talking about the same
thing. In case there is an issue, it can easily become a unit test.
That said, I looked at vhost mbuf allocation and gso segmentation, and
I found some strange things:
1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
   ext mbuf.
   a/ The first one stores the shinfo struct in the mbuf, basically
      like this:
	pkt = rte_pktmbuf_alloc(mp);
	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
	shinfo->free_cb = virtio_dev_extbuf_free;
	shinfo->fcb_opaque = buf;
	rte_mbuf_ext_refcnt_set(shinfo, 1);
      I don't think it is a good idea, because there is no guarantee that
      the mbuf won't be freed before the buffer. For instance, doing
      this will probably fail:
	pkt2 = rte_pktmbuf_alloc(mp);
	rte_pktmbuf_attach(pkt2, pkt);
	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
      To do this properly, the mbuf refcnt should be increased, and
      the mbuf should be freed in the callback. But I don't think it's
      worth doing it, given the second path (b/) looks good to me.
   b/ The second path stores the shinfo struct at the end of the
      allocated buffer, like this:
	pkt = rte_pktmbuf_alloc(mp);
	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
					      virtio_dev_extbuf_free, buf);
      I think this is correct, because we have the guarantee that shinfo
      exists as long as the buffer exists.
2/ in rte_gso_segment(), there is a loop like this:
	while (pkt_seg) {
		rte_mbuf_refcnt_update(pkt_seg, -1);
		pkt_seg = pkt_seg->next;
	}
   You change it to take in account the refcnt for ext mbufs.
   I may have missed something but I wonder why it is not simply:
	rte_pktmbuf_free(pkt_seg);
   It will decrease the proper refcnt, and free the mbufs if they
   are not used.
Again, sorry if this is not the issue your are referring to, but
in this case I really think that having a simple example code that
shows the issue would help.
Regards,
Olivier
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index e663fd4..3b69cbb 100644
> 
> --- a/lib/librte_vhost/virtio_net.c
> 
> +++ b/lib/librte_vhost/virtio_net.c
> 
> @@ -2136,10 +2136,20 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> 
>                 return NULL;
> 
>  }
> 
>  
> 
> +struct shinfo_arg {
> 
> +             void *buf;
> 
> +             struct rte_mbuf *mbuf;
> 
> +};
> 
> +
> 
>  static void
> 
> -virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
> 
> +virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> 
>  {
> 
> -              rte_free(opaque);
> 
> +             struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> 
> +
> 
> +             rte_free(arg->buf);
> 
> +             if (caller_m != arg->mbuf)
> 
> +                             rte_pktmbuf_free(arg->mbuf);
> 
> +             rte_free(arg);
> 
>  }
> 
>  
> 
>  static int
> 
> @@ -2172,8 +2182,14 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> 
>  
> 
>                 /* Initialize shinfo */
> 
>                 if (shinfo) {
> 
> +                             struct shinfo_arg *arg = (struct shinfo_arg *)
> 
> +                                             rte_malloc(NULL, sizeof(struct shinfo_arg),
> 
> +                                                                RTE_CACHE_LINE_SIZE);
> 
> +
> 
> +                             arg->buf = buf;
> 
> +                             arg->mbuf = pkt;
> 
>                                 shinfo->free_cb = virtio_dev_extbuf_free;
> 
> -                              shinfo->fcb_opaque = buf;
> 
> +                             shinfo->fcb_opaque = arg;
> 
>                                 rte_mbuf_ext_refcnt_set(shinfo, 1);
> 
>                 } else {
> 
>                                 shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> 
> --
> 
> 1.8.3.1
> 
> 
> 
> /*
>  * Allocate a host supported pktmbuf.
>  */
> static __rte_always_inline struct rte_mbuf *
> virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
>                          uint32_t data_len)
> {
>         struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> 
>         if (unlikely(pkt == NULL)) {
>                 VHOST_LOG_DATA(ERR,
>                         "Failed to allocate memory for mbuf.\n");
>                 return NULL;
>         }
> 
>         if (rte_pktmbuf_tailroom(pkt) >= data_len)
>                 return pkt;
> 
>         /* attach an external buffer if supported */
>         if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
>                 return pkt;
> 
>         /* check if chained buffers are allowed */
>         if (!dev->linearbuf)
>                 return pkt;
> 
>         /* Data doesn't fit into the buffer and the host supports
>          * only linear buffers
>          */
>         rte_pktmbuf_free(pkt);
> 
>         return NULL;
> }
>  
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-02 20:29       ` Olivier Matz
@ 2020-08-02 20:45         ` Olivier Matz
  2020-08-03  1:32           ` yang_y_yi
  2020-08-03  1:26         ` yang_y_yi
  1 sibling, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-08-02 20:45 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
On Sun, Aug 02, 2020 at 10:29:07PM +0200, Olivier Matz wrote:
> Hi,
> 
> On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> > 
> > 
> > At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> > >Hi,
> > >
> > >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> > >> From: Yi Yang <yangyi01@inspur.com>
> > >> 
> > >> In GSO case, segmented mbufs are attached to original
> > >> mbuf which can't be freed when it is external. The issue
> > >> is free_cb doesn't know original mbuf and doesn't free
> > >> it when refcnt of shinfo is 0.
> > >> 
> > >> Original mbuf can be freed by rte_pktmbuf_free segmented
> > >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> > >> cases should have different behaviors. free_cb won't
> > >> explicitly call rte_pktmbuf_free to free original mbuf
> > >> if it is freed by rte_pktmbuf_free original mbuf, but it
> > >> has to call rte_pktmbuf_free to free original mbuf if it
> > >> is freed by rte_pktmbuf_free segmented mbufs.
> > >> 
> > >> In order to fix this issue, free_cb interface has to been
> > >> changed, __rte_pktmbuf_free_extbuf must deliver called
> > >> mbuf pointer to free_cb, argument opaque can be defined
> > >> as a custom struct by user, it can includes original mbuf
> > >> pointer, user-defined free_cb can compare caller mbuf with
> > >> mbuf in opaque struct, free_cb should free original mbuf
> > >> if they are not same, this corresponds to rte_pktmbuf_free
> > >> segmented mbufs case, otherwise, free_cb won't free original
> > >> mbuf because the caller explicitly called rte_pktmbuf_free
> > >> to free it.
> > >> 
> > >> Here is pseduo code to show two kind of cases.
> > >> 
> > >> case 1. rte_pktmbuf_free segmented mbufs
> > >> 
> > >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> > >>                       &gso_ctx,
> > >>                       /* segmented mbuf */
> > >>                       (struct rte_mbuf **)&gso_mbufs,
> > >>                       MAX_GSO_MBUFS);
> > >
> > >I'm sorry but it is not very clear to me what operations are done by
> > >rte_gso_segment().
> > >
> > >In the current code, I only see calls to rte_pktmbuf_attach(),
> > >which do not deal with external buffers. Am I missing something?
> > >
> > >Are you able to show the issue only with mbuf functions? It would
> > >be helpful to understand what does not work.
> > >
> > >
> > >Thanks,
> > >Olivier
> > >
> > Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> > virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> 
> I'm sill not sure to get your issue. Please, if you have a simple test
> case using only mbufs functions (without virtio, gso, ...), it would be
> very helpful because we will be sure that we are talking about the same
> thing. In case there is an issue, it can easily become a unit test.
> 
> That said, I looked at vhost mbuf allocation and gso segmentation, and
> I found some strange things:
> 
> 1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>    ext mbuf.
> 
>    a/ The first one stores the shinfo struct in the mbuf, basically
>       like this:
> 
> 	pkt = rte_pktmbuf_alloc(mp);
> 	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> 	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> 	shinfo->free_cb = virtio_dev_extbuf_free;
> 	shinfo->fcb_opaque = buf;
> 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> 
>       I don't think it is a good idea, because there is no guarantee that
>       the mbuf won't be freed before the buffer. For instance, doing
>       this will probably fail:
> 
> 	pkt2 = rte_pktmbuf_alloc(mp);
> 	rte_pktmbuf_attach(pkt2, pkt);
> 	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> 
>       To do this properly, the mbuf refcnt should be increased, and
>       the mbuf should be freed in the callback. But I don't think it's
>       worth doing it, given the second path (b/) looks good to me.
> 
>    b/ The second path stores the shinfo struct at the end of the
>       allocated buffer, like this:
> 
> 	pkt = rte_pktmbuf_alloc(mp);
> 	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> 	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> 	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> 	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> 					      virtio_dev_extbuf_free, buf);
> 
>       I think this is correct, because we have the guarantee that shinfo
>       exists as long as the buffer exists.
> 
> 2/ in rte_gso_segment(), there is a loop like this:
> 
> 	while (pkt_seg) {
> 		rte_mbuf_refcnt_update(pkt_seg, -1);
> 		pkt_seg = pkt_seg->next;
> 	}
> 
>    You change it to take in account the refcnt for ext mbufs.
> 
>    I may have missed something but I wonder why it is not simply:
> 
> 	rte_pktmbuf_free(pkt_seg);
Here, I mean rte_pktmbuf_free(pkt)
(one free() call for the mbuf chain, not one per segment)
> 
>    It will decrease the proper refcnt, and free the mbufs if they
>    are not used.
> 
> Again, sorry if this is not the issue your are referring to, but
> in this case I really think that having a simple example code that
> shows the issue would help.
> 
> Regards,
> Olivier
> 
> 
> 
> > 
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index e663fd4..3b69cbb 100644
> > 
> > --- a/lib/librte_vhost/virtio_net.c
> > 
> > +++ b/lib/librte_vhost/virtio_net.c
> > 
> > @@ -2136,10 +2136,20 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > 
> >                 return NULL;
> > 
> >  }
> > 
> >  
> > 
> > +struct shinfo_arg {
> > 
> > +             void *buf;
> > 
> > +             struct rte_mbuf *mbuf;
> > 
> > +};
> > 
> > +
> > 
> >  static void
> > 
> > -virtio_dev_extbuf_free(struct rte_mbuf * caller_m __rte_unused, void *opaque)
> > 
> > +virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> > 
> >  {
> > 
> > -              rte_free(opaque);
> > 
> > +             struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> > 
> > +
> > 
> > +             rte_free(arg->buf);
> > 
> > +             if (caller_m != arg->mbuf)
> > 
> > +                             rte_pktmbuf_free(arg->mbuf);
> > 
> > +             rte_free(arg);
> > 
> >  }
> > 
> >  
> > 
> >  static int
> > 
> > @@ -2172,8 +2182,14 @@ uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > 
> >  
> > 
> >                 /* Initialize shinfo */
> > 
> >                 if (shinfo) {
> > 
> > +                             struct shinfo_arg *arg = (struct shinfo_arg *)
> > 
> > +                                             rte_malloc(NULL, sizeof(struct shinfo_arg),
> > 
> > +                                                                RTE_CACHE_LINE_SIZE);
> > 
> > +
> > 
> > +                             arg->buf = buf;
> > 
> > +                             arg->mbuf = pkt;
> > 
> >                                 shinfo->free_cb = virtio_dev_extbuf_free;
> > 
> > -                              shinfo->fcb_opaque = buf;
> > 
> > +                             shinfo->fcb_opaque = arg;
> > 
> >                                 rte_mbuf_ext_refcnt_set(shinfo, 1);
> > 
> >                 } else {
> > 
> >                                 shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > 
> > --
> > 
> > 1.8.3.1
> > 
> > 
> > 
> > /*
> >  * Allocate a host supported pktmbuf.
> >  */
> > static __rte_always_inline struct rte_mbuf *
> > virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
> >                          uint32_t data_len)
> > {
> >         struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > 
> >         if (unlikely(pkt == NULL)) {
> >                 VHOST_LOG_DATA(ERR,
> >                         "Failed to allocate memory for mbuf.\n");
> >                 return NULL;
> >         }
> > 
> >         if (rte_pktmbuf_tailroom(pkt) >= data_len)
> >                 return pkt;
> > 
> >         /* attach an external buffer if supported */
> >         if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
> >                 return pkt;
> > 
> >         /* check if chained buffers are allowed */
> >         if (!dev->linearbuf)
> >                 return pkt;
> > 
> >         /* Data doesn't fit into the buffer and the host supports
> >          * only linear buffers
> >          */
> >         rte_pktmbuf_free(pkt);
> > 
> >         return NULL;
> > }
> >  
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-02 20:29       ` Olivier Matz
  2020-08-02 20:45         ` Olivier Matz
@ 2020-08-03  1:26         ` yang_y_yi
  2020-08-03  8:11           ` Olivier Matz
  1 sibling, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-08-03  1:26 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>Hi,
>
>On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> 
>> 
>> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >Hi,
>> >
>> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> From: Yi Yang <yangyi01@inspur.com>
>> >> 
>> >> In GSO case, segmented mbufs are attached to original
>> >> mbuf which can't be freed when it is external. The issue
>> >> is free_cb doesn't know original mbuf and doesn't free
>> >> it when refcnt of shinfo is 0.
>> >> 
>> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> cases should have different behaviors. free_cb won't
>> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> 
>> >> In order to fix this issue, free_cb interface has to been
>> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> as a custom struct by user, it can includes original mbuf
>> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> to free it.
>> >> 
>> >> Here is pseduo code to show two kind of cases.
>> >> 
>> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> 
>> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >>                       &gso_ctx,
>> >>                       /* segmented mbuf */
>> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >>                       MAX_GSO_MBUFS);
>> >
>> >I'm sorry but it is not very clear to me what operations are done by
>> >rte_gso_segment().
>> >
>> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >which do not deal with external buffers. Am I missing something?
>> >
>> >Are you able to show the issue only with mbuf functions? It would
>> >be helpful to understand what does not work.
>> >
>> >
>> >Thanks,
>> >Olivier
>> >
>> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>
>I'm sill not sure to get your issue. Please, if you have a simple test
>case using only mbufs functions (without virtio, gso, ...), it would be
>very helpful because we will be sure that we are talking about the same
>thing. In case there is an issue, it can easily become a unit test.
Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
will call rte_eth_tx_burst to send them finally.
>
>That said, I looked at vhost mbuf allocation and gso segmentation, and
>I found some strange things:
>
>1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>   ext mbuf.
>
>   a/ The first one stores the shinfo struct in the mbuf, basically
>      like this:
>
>	pkt = rte_pktmbuf_alloc(mp);
>	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>	shinfo->free_cb = virtio_dev_extbuf_free;
>	shinfo->fcb_opaque = buf;
>	rte_mbuf_ext_refcnt_set(shinfo, 1);
>
>      I don't think it is a good idea, because there is no guarantee that
>      the mbuf won't be freed before the buffer. For instance, doing
>      this will probably fail:
>
>	pkt2 = rte_pktmbuf_alloc(mp);
>	rte_pktmbuf_attach(pkt2, pkt);
>	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
pkt is created by the application I can control, so I can control it where it will be freed, right?
>
>      To do this properly, the mbuf refcnt should be increased, and
>      the mbuf should be freed in the callback. But I don't think it's
>      worth doing it, given the second path (b/) looks good to me.
>
>   b/ The second path stores the shinfo struct at the end of the
>      allocated buffer, like this:
>
>	pkt = rte_pktmbuf_alloc(mp);
>	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>					      virtio_dev_extbuf_free, buf);
>
>      I think this is correct, because we have the guarantee that shinfo
>      exists as long as the buffer exists.
What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
free original mbuf which owns shinfo buffer.
>
>2/ in rte_gso_segment(), there is a loop like this:
>
>	while (pkt_seg) {
>		rte_mbuf_refcnt_update(pkt_seg, -1);
>		pkt_seg = pkt_seg->next;
>	}
>
>   You change it to take in account the refcnt for ext mbufs.
>
>   I may have missed something but I wonder why it is not simply:
>
>	rte_pktmbuf_free(pkt_seg);
>
>   It will decrease the proper refcnt, and free the mbufs if they
>   are not used.
Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
mbuf free will trigger freeing original mbuf (only free_cb can do this).
>
>Again, sorry if this is not the issue your are referring to, but
>in this case I really think that having a simple example code that
>shows the issue would help.
Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
unit test can show it. Let me summarize it here again. For original mbuf, there are two cases freeing
it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
free_cb call which will free original large & extended mbuf. Case two is PMD driver will call free against
original mbuf, that also will call free_cb to free attached extended buffer. In case one free_cb must call
rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>
>Regards,
>Olivier
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-02 20:45         ` Olivier Matz
@ 2020-08-03  1:32           ` yang_y_yi
  0 siblings, 0 replies; 21+ messages in thread
From: yang_y_yi @ 2020-08-03  1:32 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
At 2020-08-03 04:45:12, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>On Sun, Aug 02, 2020 at 10:29:07PM +0200, Olivier Matz wrote:
>> Hi,
>> 
>> On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> > 
>> > 
>> > At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> > >Hi,
>> > >
>> > >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> > >> From: Yi Yang <yangyi01@inspur.com>
>> > >> 
>> > >> In GSO case, segmented mbufs are attached to original
>> > >> mbuf which can't be freed when it is external. The issue
>> > >> is free_cb doesn't know original mbuf and doesn't free
>> > >> it when refcnt of shinfo is 0.
>> > >> 
>> > >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> > >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> > >> cases should have different behaviors. free_cb won't
>> > >> explicitly call rte_pktmbuf_free to free original mbuf
>> > >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> > >> has to call rte_pktmbuf_free to free original mbuf if it
>> > >> is freed by rte_pktmbuf_free segmented mbufs.
>> > >> 
>> > >> In order to fix this issue, free_cb interface has to been
>> > >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> > >> mbuf pointer to free_cb, argument opaque can be defined
>> > >> as a custom struct by user, it can includes original mbuf
>> > >> pointer, user-defined free_cb can compare caller mbuf with
>> > >> mbuf in opaque struct, free_cb should free original mbuf
>> > >> if they are not same, this corresponds to rte_pktmbuf_free
>> > >> segmented mbufs case, otherwise, free_cb won't free original
>> > >> mbuf because the caller explicitly called rte_pktmbuf_free
>> > >> to free it.
>> > >> 
>> > >> Here is pseduo code to show two kind of cases.
>> > >> 
>> > >> case 1. rte_pktmbuf_free segmented mbufs
>> > >> 
>> > >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> > >>                       &gso_ctx,
>> > >>                       /* segmented mbuf */
>> > >>                       (struct rte_mbuf **)&gso_mbufs,
>> > >>                       MAX_GSO_MBUFS);
>> > >
>> > >I'm sorry but it is not very clear to me what operations are done by
>> > >rte_gso_segment().
>> > >
>> > >In the current code, I only see calls to rte_pktmbuf_attach(),
>> > >which do not deal with external buffers. Am I missing something?
>> > >
>> > >Are you able to show the issue only with mbuf functions? It would
>> > >be helpful to understand what does not work.
>> > >
>> > >
>> > >Thanks,
>> > >Olivier
>> > >
>> > Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> > virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> 
>> I'm sill not sure to get your issue. Please, if you have a simple test
>> case using only mbufs functions (without virtio, gso, ...), it would be
>> very helpful because we will be sure that we are talking about the same
>> thing. In case there is an issue, it can easily become a unit test.
>> 
>> That said, I looked at vhost mbuf allocation and gso segmentation, and
>> I found some strange things:
>> 
>> 1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>>    ext mbuf.
>> 
>>    a/ The first one stores the shinfo struct in the mbuf, basically
>>       like this:
>> 
>> 	pkt = rte_pktmbuf_alloc(mp);
>> 	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> 	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> 	shinfo->free_cb = virtio_dev_extbuf_free;
>> 	shinfo->fcb_opaque = buf;
>> 	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> 
>>       I don't think it is a good idea, because there is no guarantee that
>>       the mbuf won't be freed before the buffer. For instance, doing
>>       this will probably fail:
>> 
>> 	pkt2 = rte_pktmbuf_alloc(mp);
>> 	rte_pktmbuf_attach(pkt2, pkt);
>> 	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> 
>>       To do this properly, the mbuf refcnt should be increased, and
>>       the mbuf should be freed in the callback. But I don't think it's
>>       worth doing it, given the second path (b/) looks good to me.
>> 
>>    b/ The second path stores the shinfo struct at the end of the
>>       allocated buffer, like this:
>> 
>> 	pkt = rte_pktmbuf_alloc(mp);
>> 	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> 	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> 	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> 	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> 					      virtio_dev_extbuf_free, buf);
>> 
>>       I think this is correct, because we have the guarantee that shinfo
>>       exists as long as the buffer exists.
>> 
>> 2/ in rte_gso_segment(), there is a loop like this:
>> 
>> 	while (pkt_seg) {
>> 		rte_mbuf_refcnt_update(pkt_seg, -1);
>> 		pkt_seg = pkt_seg->next;
>> 	}
>> 
>>    You change it to take in account the refcnt for ext mbufs.
>> 
>>    I may have missed something but I wonder why it is not simply:
>> 
>> 	rte_pktmbuf_free(pkt_seg);
>
>Here, I mean rte_pktmbuf_free(pkt)
>(one free() call for the mbuf chain, not one per segment)
Oliver, what rte_gso_segment does do here is just to decrease refcnt by one in order that
last segmented mbuf free can trigger free_cb call, it is only one goal of the code here,
rte_gso_segment just segments a large mbuf to multiple small mbufs, it won't handle anything
else.
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-03  1:26         ` yang_y_yi
@ 2020-08-03  8:11           ` Olivier Matz
  2020-08-03  9:42             ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-08-03  8:11 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >Hi,
> >
> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >> 
> >> 
> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >Hi,
> >> >
> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >> 
> >> >> In GSO case, segmented mbufs are attached to original
> >> >> mbuf which can't be freed when it is external. The issue
> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> it when refcnt of shinfo is 0.
> >> >> 
> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> cases should have different behaviors. free_cb won't
> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >> 
> >> >> In order to fix this issue, free_cb interface has to been
> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> as a custom struct by user, it can includes original mbuf
> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> to free it.
> >> >> 
> >> >> Here is pseduo code to show two kind of cases.
> >> >> 
> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >> 
> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >> >>                       &gso_ctx,
> >> >>                       /* segmented mbuf */
> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >>                       MAX_GSO_MBUFS);
> >> >
> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >rte_gso_segment().
> >> >
> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >which do not deal with external buffers. Am I missing something?
> >> >
> >> >Are you able to show the issue only with mbuf functions? It would
> >> >be helpful to understand what does not work.
> >> >
> >> >
> >> >Thanks,
> >> >Olivier
> >> >
> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> >
> >I'm sill not sure to get your issue. Please, if you have a simple test
> >case using only mbufs functions (without virtio, gso, ...), it would be
> >very helpful because we will be sure that we are talking about the same
> >thing. In case there is an issue, it can easily become a unit test.
> 
> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
> will call rte_eth_tx_burst to send them finally.
>
> >
> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >I found some strange things:
> >
> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >   ext mbuf.
> >
> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >      like this:
> >
> >	pkt = rte_pktmbuf_alloc(mp);
> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >	shinfo->free_cb = virtio_dev_extbuf_free;
> >	shinfo->fcb_opaque = buf;
> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >
> >      I don't think it is a good idea, because there is no guarantee that
> >      the mbuf won't be freed before the buffer. For instance, doing
> >      this will probably fail:
> >
> >	pkt2 = rte_pktmbuf_alloc(mp);
> >	rte_pktmbuf_attach(pkt2, pkt);
> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> 
> pkt is created by the application I can control, so I can control it where it will be freed, right?
This example shows that mbufs allocated like this by the vhost
driver are not constructed correctly. If an application attach a new
packet (pkt2) to it and frees the original one (pkt), it may result in a
memory corruption.
Of course, to be tested and confirmed.
> 
> >
> >      To do this properly, the mbuf refcnt should be increased, and
> >      the mbuf should be freed in the callback. But I don't think it's
> >      worth doing it, given the second path (b/) looks good to me.
> >
> >   b/ The second path stores the shinfo struct at the end of the
> >      allocated buffer, like this:
> >
> >	pkt = rte_pktmbuf_alloc(mp);
> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >					      virtio_dev_extbuf_free, buf);
> >
> >      I think this is correct, because we have the guarantee that shinfo
> >      exists as long as the buffer exists.
> 
> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
> free original mbuf which owns shinfo buffer.
I don't get your question.
I'm just saying that this code path looks correct, compared to
the previous one.
> 
> >
> >2/ in rte_gso_segment(), there is a loop like this:
> >
> >	while (pkt_seg) {
> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >		pkt_seg = pkt_seg->next;
> >	}
> >
> >   You change it to take in account the refcnt for ext mbufs.
> >
> >   I may have missed something but I wonder why it is not simply:
> >
> >	rte_pktmbuf_free(pkt_seg);
> >
> >   It will decrease the proper refcnt, and free the mbufs if they
> >   are not used.
> 
> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
> mbuf free will trigger freeing original mbuf (only free_cb can do this).
rte_pktmbuf_free() will also decerase the refcnt, and free the resources
when the refcnt reaches 0.
It has some advantages compared to decrease the reference counter of all
segments:
- no need to iterate the segments, there is only one function call
- no need to have a special case for ext mbufs like you did in your patch
- it may be safer, in case some segments have a refcnt == 1, because
  resources will be freed.
> >Again, sorry if this is not the issue your are referring to, but
> >in this case I really think that having a simple example code that
> >shows the issue would help.
> 
> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
> unit test can show it.
I don't see why. The PMDs and the libraries use the mbuf functions, why
a unit test couldn't call the same functions?
> Let me summarize it here again. For original mbuf, there are two cases freeing
> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
> free_cb call which will free original large & extended mbuf.
OK
> Case two is PMD driver will call free against
> original mbuf, that also will call free_cb to free attached extended buffer.
OK
And what makes that case 1 or case 2 is executed?
> In case one free_cb must call
> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
I think there is no need to change the free_cb API. It should work like
this:
- virtio creates the original external mbuf (orig_m)
- gso will create a new mbuf referencing the external buffer (new_m)
At this point, the shinfo has a refcnt of 2. The large buffer will be
freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
whatever the order.
Regards,
Olivier
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-03  8:11           ` Olivier Matz
@ 2020-08-03  9:42             ` yang_y_yi
  2020-08-03 12:34               ` Olivier Matz
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-08-03  9:42 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >Hi,
>> >
>> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> 
>> >> 
>> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >Hi,
>> >> >
>> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> 
>> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> it when refcnt of shinfo is 0.
>> >> >> 
>> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> cases should have different behaviors. free_cb won't
>> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> 
>> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> to free it.
>> >> >> 
>> >> >> Here is pseduo code to show two kind of cases.
>> >> >> 
>> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> 
>> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >>                       &gso_ctx,
>> >> >>                       /* segmented mbuf */
>> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >>                       MAX_GSO_MBUFS);
>> >> >
>> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >rte_gso_segment().
>> >> >
>> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >which do not deal with external buffers. Am I missing something?
>> >> >
>> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >be helpful to understand what does not work.
>> >> >
>> >> >
>> >> >Thanks,
>> >> >Olivier
>> >> >
>> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >
>> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >very helpful because we will be sure that we are talking about the same
>> >thing. In case there is an issue, it can easily become a unit test.
>> 
>> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> will call rte_eth_tx_burst to send them finally.
>>
>> >
>> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >I found some strange things:
>> >
>> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >   ext mbuf.
>> >
>> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >      like this:
>> >
>> >	pkt = rte_pktmbuf_alloc(mp);
>> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >	shinfo->fcb_opaque = buf;
>> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >
>> >      I don't think it is a good idea, because there is no guarantee that
>> >      the mbuf won't be freed before the buffer. For instance, doing
>> >      this will probably fail:
>> >
>> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >	rte_pktmbuf_attach(pkt2, pkt);
>> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> 
>> pkt is created by the application I can control, so I can control it where it will be freed, right?
>
>This example shows that mbufs allocated like this by the vhost
>driver are not constructed correctly. If an application attach a new
>packet (pkt2) to it and frees the original one (pkt), it may result in a
>memory corruption.
>
>Of course, to be tested and confirmed.
No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
0, isn't it?
>
>> 
>> >
>> >      To do this properly, the mbuf refcnt should be increased, and
>> >      the mbuf should be freed in the callback. But I don't think it's
>> >      worth doing it, given the second path (b/) looks good to me.
>> >
>> >   b/ The second path stores the shinfo struct at the end of the
>> >      allocated buffer, like this:
>> >
>> >	pkt = rte_pktmbuf_alloc(mp);
>> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >					      virtio_dev_extbuf_free, buf);
>> >
>> >      I think this is correct, because we have the guarantee that shinfo
>> >      exists as long as the buffer exists.
>> 
>> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> free original mbuf which owns shinfo buffer.
>
>I don't get your question.
>
>I'm just saying that this code path looks correct, compared to
>the previous one.
I think you're challenging principle of external mbuf, that isn't the thing I address.
>
>> 
>> >
>> >2/ in rte_gso_segment(), there is a loop like this:
>> >
>> >	while (pkt_seg) {
>> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >		pkt_seg = pkt_seg->next;
>> >	}
>> >
>> >   You change it to take in account the refcnt for ext mbufs.
>> >
>> >   I may have missed something but I wonder why it is not simply:
>> >
>> >	rte_pktmbuf_free(pkt_seg);
>> >
>> >   It will decrease the proper refcnt, and free the mbufs if they
>> >   are not used.
>> 
>> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>
>rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>when the refcnt reaches 0.
>
>It has some advantages compared to decrease the reference counter of all
>segments:
>
>- no need to iterate the segments, there is only one function call
>- no need to have a special case for ext mbufs like you did in your patch
>- it may be safer, in case some segments have a refcnt == 1, because
>  resources will be freed.
For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>
>> >Again, sorry if this is not the issue your are referring to, but
>> >in this case I really think that having a simple example code that
>> >shows the issue would help.
>> 
>> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> unit test can show it.
>
>I don't see why. The PMDs and the libraries use the mbuf functions, why
>a unit test couldn't call the same functions?
>
>> Let me summarize it here again. For original mbuf, there are two cases freeing
>> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> free_cb call which will free original large & extended mbuf.
>
>OK
>
>> Case two is PMD driver will call free against
>> original mbuf, that also will call free_cb to free attached extended buffer.
>
>OK
>
>And what makes that case 1 or case 2 is executed?
>
>> In case one free_cb must call
>> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>
>I think there is no need to change the free_cb API. It should work like
>this:
>
>- virtio creates the original external mbuf (orig_m)
>- gso will create a new mbuf referencing the external buffer (new_m)
>
>At this point, the shinfo has a refcnt of 2. The large buffer will be
>freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>whatever the order.
>
>Regards,
>Olivier
Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
pointer), it isn't a problem at all.
Please show me code how you can fix my issue if you don't change free_cb, thank you.
struct shinfo_arg {
       void *buf;
       struct rte_mbuf *mbuf;
};
virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
{
       struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
       rte_free(arg->buf);
       if (caller_m != arg->mbuf)
               rte_pktmbuf_free(arg->mbuf);
       rte_free(arg);
}
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-03  9:42             ` yang_y_yi
@ 2020-08-03 12:34               ` Olivier Matz
  2020-08-04  1:31                 ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-08-03 12:34 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >Hi,
> >> >
> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >> >> 
> >> >> 
> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >Hi,
> >> >> >
> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >> >> 
> >> >> >> In GSO case, segmented mbufs are attached to original
> >> >> >> mbuf which can't be freed when it is external. The issue
> >> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> >> it when refcnt of shinfo is 0.
> >> >> >> 
> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> >> cases should have different behaviors. free_cb won't
> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >> >> 
> >> >> >> In order to fix this issue, free_cb interface has to been
> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> >> as a custom struct by user, it can includes original mbuf
> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> >> to free it.
> >> >> >> 
> >> >> >> Here is pseduo code to show two kind of cases.
> >> >> >> 
> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >> >> 
> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >> >> >>                       &gso_ctx,
> >> >> >>                       /* segmented mbuf */
> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >> >>                       MAX_GSO_MBUFS);
> >> >> >
> >> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >> >rte_gso_segment().
> >> >> >
> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >> >which do not deal with external buffers. Am I missing something?
> >> >> >
> >> >> >Are you able to show the issue only with mbuf functions? It would
> >> >> >be helpful to understand what does not work.
> >> >> >
> >> >> >
> >> >> >Thanks,
> >> >> >Olivier
> >> >> >
> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> >> >
> >> >I'm sill not sure to get your issue. Please, if you have a simple test
> >> >case using only mbufs functions (without virtio, gso, ...), it would be
> >> >very helpful because we will be sure that we are talking about the same
> >> >thing. In case there is an issue, it can easily become a unit test.
> >> 
> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
> >> will call rte_eth_tx_burst to send them finally.
> >>
> >> >
> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >> >I found some strange things:
> >> >
> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >> >   ext mbuf.
> >> >
> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >> >      like this:
> >> >
> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
> >> >	shinfo->fcb_opaque = buf;
> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >> >
> >> >      I don't think it is a good idea, because there is no guarantee that
> >> >      the mbuf won't be freed before the buffer. For instance, doing
> >> >      this will probably fail:
> >> >
> >> >	pkt2 = rte_pktmbuf_alloc(mp);
> >> >	rte_pktmbuf_attach(pkt2, pkt);
> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> >> 
> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
> >
> >This example shows that mbufs allocated like this by the vhost
> >driver are not constructed correctly. If an application attach a new
> >packet (pkt2) to it and frees the original one (pkt), it may result in a
> >memory corruption.
> >
> >Of course, to be tested and confirmed.
> 
> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
> 0, isn't it?
When pkt will be freed, it will decrement the shinfo refcnt, and
after it will be 1. So the buffer won't be freed. After that, the
mbuf pkt will be detached, and will return to the mbuf pool. It means
it can be reallocated, and the next user can overwrite shinfo which
is still stored in the mbuf data.
I did a test to show it, see:
http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
If you run the mbuf autotest, it segfaults.
> 
> >
> >> 
> >> >
> >> >      To do this properly, the mbuf refcnt should be increased, and
> >> >      the mbuf should be freed in the callback. But I don't think it's
> >> >      worth doing it, given the second path (b/) looks good to me.
> >> >
> >> >   b/ The second path stores the shinfo struct at the end of the
> >> >      allocated buffer, like this:
> >> >
> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >> >					      virtio_dev_extbuf_free, buf);
> >> >
> >> >      I think this is correct, because we have the guarantee that shinfo
> >> >      exists as long as the buffer exists.
> >> 
> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
> >> free original mbuf which owns shinfo buffer.
> >
> >I don't get your question.
> >
> >I'm just saying that this code path looks correct, compared to
> >the previous one.
> 
> I think you're challenging principle of external mbuf, that isn't the thing I address.
I'm not challenging anything, I'm saying there is a bug in this code,
and the unit test above tends to confirm it.
> 
> >
> >> 
> >> >
> >> >2/ in rte_gso_segment(), there is a loop like this:
> >> >
> >> >	while (pkt_seg) {
> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >> >		pkt_seg = pkt_seg->next;
> >> >	}
> >> >
> >> >   You change it to take in account the refcnt for ext mbufs.
> >> >
> >> >   I may have missed something but I wonder why it is not simply:
> >> >
> >> >	rte_pktmbuf_free(pkt_seg);
> >> >
> >> >   It will decrease the proper refcnt, and free the mbufs if they
> >> >   are not used.
> >> 
> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
> >
> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
> >when the refcnt reaches 0.
> >
> >It has some advantages compared to decrease the reference counter of all
> >segments:
> >
> >- no need to iterate the segments, there is only one function call
> >- no need to have a special case for ext mbufs like you did in your patch
> >- it may be safer, in case some segments have a refcnt == 1, because
> >  resources will be freed.
> 
> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
I suppose rte_gso_segment() can take any mbuf type as input: standard
mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
different types.
For instance, if you pass a chain of 2 mbufs:
- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
- the second on is a mbuf pointing to an ext buffer (orig_payload)
I expect that the resulting mbuf after calling gso contains a list of mbufs
like this:
- a first segment containing the IP/TCP headers (new_hdr)
- a payload segment pointing on the same ext buffer
In theory, there is no reason that orig_hdr should be referenced by
another new mbuf, because it only contains headers (no data). If that's
the case, its refcnt is 1, and decreasing it to 0 without freeing it
is a bug.
Anyway, there is maybe no issue in that case, but I was just suggesting
that using rte_pktmbuf_free() is easier to read, and safer than manually
decreasing the refcnt of each segment.
> >> >Again, sorry if this is not the issue your are referring to, but
> >> >in this case I really think that having a simple example code that
> >> >shows the issue would help.
> >> 
> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
> >> unit test can show it.
> >
> >I don't see why. The PMDs and the libraries use the mbuf functions, why
> >a unit test couldn't call the same functions?
> >
> >> Let me summarize it here again. For original mbuf, there are two cases freeing
> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
> >> free_cb call which will free original large & extended mbuf.
> >
> >OK
> >
> >> Case two is PMD driver will call free against
> >> original mbuf, that also will call free_cb to free attached extended buffer.
> >
> >OK
> >
> >And what makes that case 1 or case 2 is executed?
> >
> >> In case one free_cb must call
> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
> >
> >I think there is no need to change the free_cb API. It should work like
> >this:
> >
> >- virtio creates the original external mbuf (orig_m)
> >- gso will create a new mbuf referencing the external buffer (new_m)
> >
> >At this point, the shinfo has a refcnt of 2. The large buffer will be
> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
> >whatever the order.
> >
> >Regards,
> >Olivier
> 
> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
The free_cb is to free the buffer, not the mbuf.
To me, it should work like this:
1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
   bug should be fixed)
2- gso create mbufs that reference the the same ext buf (by attaching the
   new mbuf)
3- gso must free the original mbuf
4- the PMD transmits the new mbufs, and frees them
Whatever 3- or 4- is executed first, at the end we are sure that:
- all mbufs will be returned to the pool
- the linear buffer will be freed when the refcnt reaches 0.
If this is still unclear, please, write a unit test like I did
above to show your issue.
Regards,
Olivier
> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
> 
> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
> pointer), it isn't a problem at all.
> 
> Please show me code how you can fix my issue if you don't change free_cb, thank you.
> 
> struct shinfo_arg {
>        void *buf;
>        struct rte_mbuf *mbuf;
> };
> 
> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> {
>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> 
>        rte_free(arg->buf);
>        if (caller_m != arg->mbuf)
>                rte_pktmbuf_free(arg->mbuf);
>        rte_free(arg);
> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-03 12:34               ` Olivier Matz
@ 2020-08-04  1:31                 ` yang_y_yi
  2020-09-27  5:55                   ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-08-04  1:31 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >Hi,
>> >> >
>> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> 
>> >> >> 
>> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >Hi,
>> >> >> >
>> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> >> 
>> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> 
>> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> 
>> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> to free it.
>> >> >> >> 
>> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> 
>> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> 
>> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >>                       &gso_ctx,
>> >> >> >>                       /* segmented mbuf */
>> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >
>> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >rte_gso_segment().
>> >> >> >
>> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >
>> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >be helpful to understand what does not work.
>> >> >> >
>> >> >> >
>> >> >> >Thanks,
>> >> >> >Olivier
>> >> >> >
>> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >> >
>> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >very helpful because we will be sure that we are talking about the same
>> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> 
>> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> >> will call rte_eth_tx_burst to send them finally.
>> >>
>> >> >
>> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >I found some strange things:
>> >> >
>> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >   ext mbuf.
>> >> >
>> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >      like this:
>> >> >
>> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >> >	shinfo->fcb_opaque = buf;
>> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >
>> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >      this will probably fail:
>> >> >
>> >> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >> >	rte_pktmbuf_attach(pkt2, pkt);
>> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> 
>> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
>> >
>> >This example shows that mbufs allocated like this by the vhost
>> >driver are not constructed correctly. If an application attach a new
>> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >memory corruption.
>> >
>> >Of course, to be tested and confirmed.
>> 
>> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
>> 0, isn't it?
>
>When pkt will be freed, it will decrement the shinfo refcnt, and
>after it will be 1. So the buffer won't be freed. After that, the
>mbuf pkt will be detached, and will return to the mbuf pool. It means
>it can be reallocated, and the next user can overwrite shinfo which
>is still stored in the mbuf data.
I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
overwrite it, he/she should use the same way as a new external buffer is attached. 
>
>I did a test to show it, see:
>http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>
>If you run the mbuf autotest, it segfaults.
I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
static inline void
rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
        rte_iova_t buf_iova, uint16_t buf_len,
        struct rte_mbuf_ext_shared_info *shinfo)
{
        /* mbuf should not be read-only */
        RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
        RTE_ASSERT(shinfo->free_cb != NULL);
For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
>
>
>> 
>> >
>> >> 
>> >> >
>> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >
>> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >      allocated buffer, like this:
>> >> >
>> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >					      virtio_dev_extbuf_free, buf);
>> >> >
>> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >      exists as long as the buffer exists.
>> >> 
>> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> >> free original mbuf which owns shinfo buffer.
>> >
>> >I don't get your question.
>> >
>> >I'm just saying that this code path looks correct, compared to
>> >the previous one.
>> 
>> I think you're challenging principle of external mbuf, that isn't the thing I address.
>
>I'm not challenging anything, I'm saying there is a bug in this code,
>and the unit test above tends to confirm it.
If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
use it in correct way, I don't think it is a bug.
>
>
>> 
>> >
>> >> 
>> >> >
>> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >
>> >> >	while (pkt_seg) {
>> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >		pkt_seg = pkt_seg->next;
>> >> >	}
>> >> >
>> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >
>> >> >   I may have missed something but I wonder why it is not simply:
>> >> >
>> >> >	rte_pktmbuf_free(pkt_seg);
>> >> >
>> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >   are not used.
>> >> 
>> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >
>> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >when the refcnt reaches 0.
>> >
>> >It has some advantages compared to decrease the reference counter of all
>> >segments:
>> >
>> >- no need to iterate the segments, there is only one function call
>> >- no need to have a special case for ext mbufs like you did in your patch
>> >- it may be safer, in case some segments have a refcnt == 1, because
>> >  resources will be freed.
>> 
>> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
>> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>
>I suppose rte_gso_segment() can take any mbuf type as input: standard
>mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>different types.
>
>For instance, if you pass a chain of 2 mbufs:
>- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>- the second on is a mbuf pointing to an ext buffer (orig_payload)
>
>I expect that the resulting mbuf after calling gso contains a list of mbufs
>like this:
>- a first segment containing the IP/TCP headers (new_hdr)
>- a payload segment pointing on the same ext buffer
>
>In theory, there is no reason that orig_hdr should be referenced by
>another new mbuf, because it only contains headers (no data). If that's
>the case, its refcnt is 1, and decreasing it to 0 without freeing it
>is a bug.
For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
>
>Anyway, there is maybe no issue in that case, but I was just suggesting
>that using rte_pktmbuf_free() is easier to read, and safer than manually
>decreasing the refcnt of each segment.
>
>
>> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >in this case I really think that having a simple example code that
>> >> >shows the issue would help.
>> >> 
>> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> >> unit test can show it.
>> >
>> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >a unit test couldn't call the same functions?
>> >
>> >> Let me summarize it here again. For original mbuf, there are two cases freeing
>> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> >> free_cb call which will free original large & extended mbuf.
>> >
>> >OK
>> >
>> >> Case two is PMD driver will call free against
>> >> original mbuf, that also will call free_cb to free attached extended buffer.
>> >
>> >OK
>> >
>> >And what makes that case 1 or case 2 is executed?
>> >
>> >> In case one free_cb must call
>> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>> >
>> >I think there is no need to change the free_cb API. It should work like
>> >this:
>> >
>> >- virtio creates the original external mbuf (orig_m)
>> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >
>> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >whatever the order.
>> >
>> >Regards,
>> >Olivier
>> 
>> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
>> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
>> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>
>If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>The free_cb is to free the buffer, not the mbuf.
>
>To me, it should work like this:
>
>1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>   bug should be fixed)
>2- gso create mbufs that reference the the same ext buf (by attaching the
>   new mbuf)
>3- gso must free the original mbuf
This is impossible, segmented mbufs are referencing external buffer in original mbuf,
how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
you know how to do this.
>4- the PMD transmits the new mbufs, and frees them
>
>Whatever 3- or 4- is executed first, at the end we are sure that:
>- all mbufs will be returned to the pool
>- the linear buffer will be freed when the refcnt reaches 0.
>
>If this is still unclear, please, write a unit test like I did
>above to show your issue.
>
>Regards,
>Olivier
>
The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
>
>
>> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
>> 
>> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
>> pointer), it isn't a problem at all.
>> 
>> Please show me code how you can fix my issue if you don't change free_cb, thank you.
>> 
>> struct shinfo_arg {
>>        void *buf;
>>        struct rte_mbuf *mbuf;
>> };
>> 
>> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> {
>>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> 
>>        rte_free(arg->buf);
>>        if (caller_m != arg->mbuf)
>>                rte_pktmbuf_free(arg->mbuf);
>>        rte_free(arg);
>> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-08-04  1:31                 ` yang_y_yi
@ 2020-09-27  5:55                   ` yang_y_yi
  2020-10-07  9:48                     ` Olivier Matz
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-09-27  5:55 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi, folks
Per GSO requirement, this is a must-have change, Jiayu, can you help review this series?
Olivier, if you used the old interface, maybe you need to change your code to adapt this, I don't think we can have a better way to handle GSO case.
At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >Hi,
>> >> >
>> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> 
>> >> >> 
>> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >Hi,
>> >> >> >
>> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> >> 
>> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> 
>> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> 
>> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> to free it.
>> >> >> >> 
>> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> 
>> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> 
>> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >>                       &gso_ctx,
>> >> >> >>                       /* segmented mbuf */
>> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >
>> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >rte_gso_segment().
>> >> >> >
>> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >
>> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >be helpful to understand what does not work.
>> >> >> >
>> >> >> >
>> >> >> >Thanks,
>> >> >> >Olivier
>> >> >> >
>> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >> >
>> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >very helpful because we will be sure that we are talking about the same
>> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> 
>> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> >> will call rte_eth_tx_burst to send them finally.
>> >>
>> >> >
>> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >I found some strange things:
>> >> >
>> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >   ext mbuf.
>> >> >
>> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >      like this:
>> >> >
>> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >> >	shinfo->fcb_opaque = buf;
>> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >
>> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >      this will probably fail:
>> >> >
>> >> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >> >	rte_pktmbuf_attach(pkt2, pkt);
>> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> 
>> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
>> >
>> >This example shows that mbufs allocated like this by the vhost
>> >driver are not constructed correctly. If an application attach a new
>> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >memory corruption.
>> >
>> >Of course, to be tested and confirmed.
>> 
>> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
>> 0, isn't it?
>
>When pkt will be freed, it will decrement the shinfo refcnt, and
>after it will be 1. So the buffer won't be freed. After that, the
>mbuf pkt will be detached, and will return to the mbuf pool. It means
>it can be reallocated, and the next user can overwrite shinfo which
>is still stored in the mbuf data.
I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
overwrite it, he/she should use the same way as a new external buffer is attached. 
>
>I did a test to show it, see:
>http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>
>If you run the mbuf autotest, it segfaults.
I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
static inline void
rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
        rte_iova_t buf_iova, uint16_t buf_len,
        struct rte_mbuf_ext_shared_info *shinfo)
{
        /* mbuf should not be read-only */
        RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
        RTE_ASSERT(shinfo->free_cb != NULL);
For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
>
>
>> 
>> >
>> >> 
>> >> >
>> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >
>> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >      allocated buffer, like this:
>> >> >
>> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >					      virtio_dev_extbuf_free, buf);
>> >> >
>> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >      exists as long as the buffer exists.
>> >> 
>> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> >> free original mbuf which owns shinfo buffer.
>> >
>> >I don't get your question.
>> >
>> >I'm just saying that this code path looks correct, compared to
>> >the previous one.
>> 
>> I think you're challenging principle of external mbuf, that isn't the thing I address.
>
>I'm not challenging anything, I'm saying there is a bug in this code,
>and the unit test above tends to confirm it.
If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
use it in correct way, I don't think it is a bug.
>
>
>> 
>> >
>> >> 
>> >> >
>> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >
>> >> >	while (pkt_seg) {
>> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >		pkt_seg = pkt_seg->next;
>> >> >	}
>> >> >
>> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >
>> >> >   I may have missed something but I wonder why it is not simply:
>> >> >
>> >> >	rte_pktmbuf_free(pkt_seg);
>> >> >
>> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >   are not used.
>> >> 
>> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >
>> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >when the refcnt reaches 0.
>> >
>> >It has some advantages compared to decrease the reference counter of all
>> >segments:
>> >
>> >- no need to iterate the segments, there is only one function call
>> >- no need to have a special case for ext mbufs like you did in your patch
>> >- it may be safer, in case some segments have a refcnt == 1, because
>> >  resources will be freed.
>> 
>> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
>> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>
>I suppose rte_gso_segment() can take any mbuf type as input: standard
>mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>different types.
>
>For instance, if you pass a chain of 2 mbufs:
>- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>- the second on is a mbuf pointing to an ext buffer (orig_payload)
>
>I expect that the resulting mbuf after calling gso contains a list of mbufs
>like this:
>- a first segment containing the IP/TCP headers (new_hdr)
>- a payload segment pointing on the same ext buffer
>
>In theory, there is no reason that orig_hdr should be referenced by
>another new mbuf, because it only contains headers (no data). If that's
>the case, its refcnt is 1, and decreasing it to 0 without freeing it
>is a bug.
For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
>
>Anyway, there is maybe no issue in that case, but I was just suggesting
>that using rte_pktmbuf_free() is easier to read, and safer than manually
>decreasing the refcnt of each segment.
>
>
>> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >in this case I really think that having a simple example code that
>> >> >shows the issue would help.
>> >> 
>> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> >> unit test can show it.
>> >
>> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >a unit test couldn't call the same functions?
>> >
>> >> Let me summarize it here again. For original mbuf, there are two cases freeing
>> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> >> free_cb call which will free original large & extended mbuf.
>> >
>> >OK
>> >
>> >> Case two is PMD driver will call free against
>> >> original mbuf, that also will call free_cb to free attached extended buffer.
>> >
>> >OK
>> >
>> >And what makes that case 1 or case 2 is executed?
>> >
>> >> In case one free_cb must call
>> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>> >
>> >I think there is no need to change the free_cb API. It should work like
>> >this:
>> >
>> >- virtio creates the original external mbuf (orig_m)
>> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >
>> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >whatever the order.
>> >
>> >Regards,
>> >Olivier
>> 
>> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
>> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
>> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>
>If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>The free_cb is to free the buffer, not the mbuf.
>
>To me, it should work like this:
>
>1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>   bug should be fixed)
>2- gso create mbufs that reference the the same ext buf (by attaching the
>   new mbuf)
>3- gso must free the original mbuf
This is impossible, segmented mbufs are referencing external buffer in original mbuf,
how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
you know how to do this.
>4- the PMD transmits the new mbufs, and frees them
>
>Whatever 3- or 4- is executed first, at the end we are sure that:
>- all mbufs will be returned to the pool
>- the linear buffer will be freed when the refcnt reaches 0.
>
>If this is still unclear, please, write a unit test like I did
>above to show your issue.
>
>Regards,
>Olivier
>
The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
>
>
>> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
>> 
>> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
>> pointer), it isn't a problem at all.
>> 
>> Please show me code how you can fix my issue if you don't change free_cb, thank you.
>> 
>> struct shinfo_arg {
>>        void *buf;
>>        struct rte_mbuf *mbuf;
>> };
>> 
>> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> {
>>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> 
>>        rte_free(arg->buf);
>>        if (caller_m != arg->mbuf)
>>                rte_pktmbuf_free(arg->mbuf);
>>        rte_free(arg);
>> }
 
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-09-27  5:55                   ` yang_y_yi
@ 2020-10-07  9:48                     ` Olivier Matz
  2020-10-09  9:51                       ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-10-07  9:48 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi,
On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
> Per GSO requirement, this is a must-have change, Jiayu, can you help review
> this series?
I can't ack this patch until I have a simple and clear test case (only
with mbuf functions, without GSO or vhost) showing the issue we have
today with current.
> Olivier, if you used the old interface, maybe you need to change your code to
> adapt this, I don't think we can have a better way to handle GSO case.
Sorry, I don't get your point. What do I need to change in which code?
(some more comments below)
> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
> 
> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >Hi,
> >> >> >
> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >> >> >> 
> >> >> >> 
> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >Hi,
> >> >> >> >
> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >> >> >> 
> >> >> >> >> In GSO case, segmented mbufs are attached to original
> >> >> >> >> mbuf which can't be freed when it is external. The issue
> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> >> >> it when refcnt of shinfo is 0.
> >> >> >> >> 
> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> >> >> cases should have different behaviors. free_cb won't
> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >> >> >> 
> >> >> >> >> In order to fix this issue, free_cb interface has to been
> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> >> >> as a custom struct by user, it can includes original mbuf
> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> >> >> to free it.
> >> >> >> >> 
> >> >> >> >> Here is pseduo code to show two kind of cases.
> >> >> >> >> 
> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >> >> >> 
> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >> >> >> >>                       &gso_ctx,
> >> >> >> >>                       /* segmented mbuf */
> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >> >> >>                       MAX_GSO_MBUFS);
> >> >> >> >
> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >> >> >rte_gso_segment().
> >> >> >> >
> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >> >> >which do not deal with external buffers. Am I missing something?
> >> >> >> >
> >> >> >> >Are you able to show the issue only with mbuf functions? It would
> >> >> >> >be helpful to understand what does not work.
> >> >> >> >
> >> >> >> >
> >> >> >> >Thanks,
> >> >> >> >Olivier
> >> >> >> >
> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> >> >> >
> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
> >> >> >very helpful because we will be sure that we are talking about the same
> >> >> >thing. In case there is an issue, it can easily become a unit test.
> >> >> 
> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
> >> >> will call rte_eth_tx_burst to send them finally.
> >> >>
> >> >> >
> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >> >> >I found some strange things:
> >> >> >
> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >> >> >   ext mbuf.
> >> >> >
> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >> >> >      like this:
> >> >> >
> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
> >> >> >	shinfo->fcb_opaque = buf;
> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >> >> >
> >> >> >      I don't think it is a good idea, because there is no guarantee that
> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
> >> >> >      this will probably fail:
> >> >> >
> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> >> >> 
> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
> >> >
> >> >This example shows that mbufs allocated like this by the vhost
> >> >driver are not constructed correctly. If an application attach a new
> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
> >> >memory corruption.
> >> >
> >> >Of course, to be tested and confirmed.
> >> 
> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
> >> 0, isn't it?
> >
> >When pkt will be freed, it will decrement the shinfo refcnt, and
> >after it will be 1. So the buffer won't be freed. After that, the
> >mbuf pkt will be detached, and will return to the mbuf pool. It means
> >it can be reallocated, and the next user can overwrite shinfo which
> >is still stored in the mbuf data.
> 
> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
> overwrite it, he/she should use the same way as a new external buffer is attached. 
No, there is no issue in DPDK. The lifetime of shinfo should be at least
the same the lifetime of the buffer. If shinfo is stored in initial mbuf
(called "pkt" in the example above), the mbuf and shinfo can be freed while
the buffer is still in use by another packet.
> >I did a test to show it, see:
> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
> >
> >If you run the mbuf autotest, it segfaults.
> 
> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
I'm changing the data of a newly allocated mbuf, it is perfectly legal.
I happens that it points the the shinfo that is still in use.
> 
> static inline void
> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
>         rte_iova_t buf_iova, uint16_t buf_len,
>         struct rte_mbuf_ext_shared_info *shinfo)
> {
>         /* mbuf should not be read-only */
>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
>         RTE_ASSERT(shinfo->free_cb != NULL);
> 
> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
> 
> >
> >
> >> 
> >> >
> >> >> 
> >> >> >
> >> >> >      To do this properly, the mbuf refcnt should be increased, and
> >> >> >      the mbuf should be freed in the callback. But I don't think it's
> >> >> >      worth doing it, given the second path (b/) looks good to me.
> >> >> >
> >> >> >   b/ The second path stores the shinfo struct at the end of the
> >> >> >      allocated buffer, like this:
> >> >> >
> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >> >> >					      virtio_dev_extbuf_free, buf);
> >> >> >
> >> >> >      I think this is correct, because we have the guarantee that shinfo
> >> >> >      exists as long as the buffer exists.
> >> >> 
> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
> >> >> free original mbuf which owns shinfo buffer.
> >> >
> >> >I don't get your question.
> >> >
> >> >I'm just saying that this code path looks correct, compared to
> >> >the previous one.
> >> 
> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
> >
> >I'm not challenging anything, I'm saying there is a bug in this code,
> >and the unit test above tends to confirm it.
> 
> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
> use it in correct way, I don't think it is a bug.
I'll submit a patch for this.
The point is you are testing GSO on top of wrongly-constructed mbufs, so
it would be safer for you to fix this before doing more tests.
> >> >> >2/ in rte_gso_segment(), there is a loop like this:
> >> >> >
> >> >> >	while (pkt_seg) {
> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >> >> >		pkt_seg = pkt_seg->next;
> >> >> >	}
> >> >> >
> >> >> >   You change it to take in account the refcnt for ext mbufs.
> >> >> >
> >> >> >   I may have missed something but I wonder why it is not simply:
> >> >> >
> >> >> >	rte_pktmbuf_free(pkt_seg);
> >> >> >
> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
> >> >> >   are not used.
> >> >> 
> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
> >> >
> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
> >> >when the refcnt reaches 0.
> >> >
> >> >It has some advantages compared to decrease the reference counter of all
> >> >segments:
> >> >
> >> >- no need to iterate the segments, there is only one function call
> >> >- no need to have a special case for ext mbufs like you did in your patch
> >> >- it may be safer, in case some segments have a refcnt == 1, because
> >> >  resources will be freed.
> >> 
> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
> >
> >I suppose rte_gso_segment() can take any mbuf type as input: standard
> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
> >different types.
> >
> >For instance, if you pass a chain of 2 mbufs:
> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
> >
> >I expect that the resulting mbuf after calling gso contains a list of mbufs
> >like this:
> >- a first segment containing the IP/TCP headers (new_hdr)
> >- a payload segment pointing on the same ext buffer
> >
> >In theory, there is no reason that orig_hdr should be referenced by
> >another new mbuf, because it only contains headers (no data). If that's
> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
> >is a bug.
> 
> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
In this case, orig_hdr has to be freed because it is a direct mbuf (not
shared).
The buffer pointed by orig_payload will be freed when all newly created
segments are freed.
> >
> >Anyway, there is maybe no issue in that case, but I was just suggesting
> >that using rte_pktmbuf_free() is easier to read, and safer than manually
> >decreasing the refcnt of each segment.
> >
> >
> >> >> >Again, sorry if this is not the issue your are referring to, but
> >> >> >in this case I really think that having a simple example code that
> >> >> >shows the issue would help.
> >> >> 
> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
> >> >> unit test can show it.
> >> >
> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
> >> >a unit test couldn't call the same functions?
> >> >
> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
> >> >> free_cb call which will free original large & extended mbuf.
> >> >
> >> >OK
> >> >
> >> >> Case two is PMD driver will call free against
> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
> >> >
> >> >OK
> >> >
> >> >And what makes that case 1 or case 2 is executed?
> >> >
> >> >> In case one free_cb must call
> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
> >> >
> >> >I think there is no need to change the free_cb API. It should work like
> >> >this:
> >> >
> >> >- virtio creates the original external mbuf (orig_m)
> >> >- gso will create a new mbuf referencing the external buffer (new_m)
> >> >
> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
> >> >whatever the order.
> >> >
> >> >Regards,
> >> >Olivier
> >> 
> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
> >
> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
> >The free_cb is to free the buffer, not the mbuf.
> >
> >To me, it should work like this:
> >
> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
> >   bug should be fixed)
> >2- gso create mbufs that reference the the same ext buf (by attaching the
> >   new mbuf)
> >3- gso must free the original mbuf
> 
> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
> you know how to do this.
As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
counters on all segments. The segments will be returned to the pool if the
refcnt reaches 0.
> 
> >4- the PMD transmits the new mbufs, and frees them
> >
> >Whatever 3- or 4- is executed first, at the end we are sure that:
> >- all mbufs will be returned to the pool
> >- the linear buffer will be freed when the refcnt reaches 0.
> >
> >If this is still unclear, please, write a unit test like I did
> >above to show your issue.
> >
> >Regards,
> >Olivier
> >
> 
> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
Sorry, but I don't see how I can be clearer to what I explained
in my previous answer.
Please, *provide a simple test example* without gso/vhost, and I can help
to make it work.
Regards,
Olivier
> >
> >
> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
> >> 
> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
> >> pointer), it isn't a problem at all.
> >> 
> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
> >> 
> >> struct shinfo_arg {
> >>        void *buf;
> >>        struct rte_mbuf *mbuf;
> >> };
> >> 
> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> >> {
> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> >> 
> >>        rte_free(arg->buf);
> >>        if (caller_m != arg->mbuf)
> >>                rte_pktmbuf_free(arg->mbuf);
> >>        rte_free(arg);
> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-10-07  9:48                     ` Olivier Matz
@ 2020-10-09  9:51                       ` yang_y_yi
  2020-10-09 11:55                         ` Olivier Matz
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-10-09  9:51 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
Olivier, thank you so much for your reply, your patch post for vhost help me understand your concern better, I totally agree. For GSO case, let me show you a simple code to explain my issue.
struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
virtio_dev_extbuf_alloc(pkt, data_len)
struct rte_mbuf * pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
rte_pktmbuf_attach(pkt_seg1, pkt);
rte_mbuf_ext_refcnt_update(pkt, -1);
struct rte_mbuf * pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
rte_pktmbuf_attach(pkt_seg2, pkt);
rte_mbuf_ext_refcnt_update(pkt, -1);
struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
Is it a simple test you expect? The issue here is nobody explicitly calls rte_pktmbuf_free(pkt), rte_pktmbuf_free(pkt_segX) in PMD driver won't free "pkt", Is it clear to you now?
At 2020-10-07 17:48:21, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>Hi,
>
>On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
>> Per GSO requirement, this is a must-have change, Jiayu, can you help review
>> this series?
>
>I can't ack this patch until I have a simple and clear test case (only
>with mbuf functions, without GSO or vhost) showing the issue we have
>today with current.
>
>> Olivier, if you used the old interface, maybe you need to change your code to
>> adapt this, I don't think we can have a better way to handle GSO case.
>
>Sorry, I don't get your point. What do I need to change in which code?
>
>(some more comments below)
>
>> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
>> 
>> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >Hi,
>> >> >> >
>> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> >> 
>> >> >> >> 
>> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >Hi,
>> >> >> >> >
>> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> >> >> 
>> >> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> >> 
>> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> >> 
>> >> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> >> to free it.
>> >> >> >> >> 
>> >> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> >> 
>> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> >> 
>> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >> >>                       &gso_ctx,
>> >> >> >> >>                       /* segmented mbuf */
>> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >> >
>> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >> >rte_gso_segment().
>> >> >> >> >
>> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >> >
>> >> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >> >be helpful to understand what does not work.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >Thanks,
>> >> >> >> >Olivier
>> >> >> >> >
>> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >> >> >
>> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >> >very helpful because we will be sure that we are talking about the same
>> >> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> >> 
>> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> >> >> will call rte_eth_tx_burst to send them finally.
>> >> >>
>> >> >> >
>> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >> >I found some strange things:
>> >> >> >
>> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >> >   ext mbuf.
>> >> >> >
>> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >> >      like this:
>> >> >> >
>> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >> >> >	shinfo->fcb_opaque = buf;
>> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >> >
>> >> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >> >      this will probably fail:
>> >> >> >
>> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
>> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> >> 
>> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
>> >> >
>> >> >This example shows that mbufs allocated like this by the vhost
>> >> >driver are not constructed correctly. If an application attach a new
>> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >> >memory corruption.
>> >> >
>> >> >Of course, to be tested and confirmed.
>> >> 
>> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
>> >> 0, isn't it?
>> >
>> >When pkt will be freed, it will decrement the shinfo refcnt, and
>> >after it will be 1. So the buffer won't be freed. After that, the
>> >mbuf pkt will be detached, and will return to the mbuf pool. It means
>> >it can be reallocated, and the next user can overwrite shinfo which
>> >is still stored in the mbuf data.
>> 
>> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
>> overwrite it, he/she should use the same way as a new external buffer is attached. 
>
>No, there is no issue in DPDK. The lifetime of shinfo should be at least
>the same the lifetime of the buffer. If shinfo is stored in initial mbuf
>(called "pkt" in the example above), the mbuf and shinfo can be freed while
>the buffer is still in use by another packet.
>
>> >I did a test to show it, see:
>> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>> >
>> >If you run the mbuf autotest, it segfaults.
>> 
>> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
>
>I'm changing the data of a newly allocated mbuf, it is perfectly legal.
>I happens that it points the the shinfo that is still in use.
>
>
>> 
>> static inline void
>> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
>>         rte_iova_t buf_iova, uint16_t buf_len,
>>         struct rte_mbuf_ext_shared_info *shinfo)
>> {
>>         /* mbuf should not be read-only */
>>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
>>         RTE_ASSERT(shinfo->free_cb != NULL);
>> 
>> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
>> 
>> >
>> >
>> >> 
>> >> >
>> >> >> 
>> >> >> >
>> >> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >> >
>> >> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >> >      allocated buffer, like this:
>> >> >> >
>> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >> >					      virtio_dev_extbuf_free, buf);
>> >> >> >
>> >> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >> >      exists as long as the buffer exists.
>> >> >> 
>> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> >> >> free original mbuf which owns shinfo buffer.
>> >> >
>> >> >I don't get your question.
>> >> >
>> >> >I'm just saying that this code path looks correct, compared to
>> >> >the previous one.
>> >> 
>> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
>> >
>> >I'm not challenging anything, I'm saying there is a bug in this code,
>> >and the unit test above tends to confirm it.
>> 
>> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
>> use it in correct way, I don't think it is a bug.
>
>I'll submit a patch for this.
>
>The point is you are testing GSO on top of wrongly-constructed mbufs, so
>it would be safer for you to fix this before doing more tests.
>
>
>> >> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >> >
>> >> >> >	while (pkt_seg) {
>> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >> >		pkt_seg = pkt_seg->next;
>> >> >> >	}
>> >> >> >
>> >> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >> >
>> >> >> >   I may have missed something but I wonder why it is not simply:
>> >> >> >
>> >> >> >	rte_pktmbuf_free(pkt_seg);
>> >> >> >
>> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >> >   are not used.
>> >> >> 
>> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >> >
>> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >> >when the refcnt reaches 0.
>> >> >
>> >> >It has some advantages compared to decrease the reference counter of all
>> >> >segments:
>> >> >
>> >> >- no need to iterate the segments, there is only one function call
>> >> >- no need to have a special case for ext mbufs like you did in your patch
>> >> >- it may be safer, in case some segments have a refcnt == 1, because
>> >> >  resources will be freed.
>> >> 
>> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
>> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>> >
>> >I suppose rte_gso_segment() can take any mbuf type as input: standard
>> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>> >different types.
>> >
>> >For instance, if you pass a chain of 2 mbufs:
>> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
>> >
>> >I expect that the resulting mbuf after calling gso contains a list of mbufs
>> >like this:
>> >- a first segment containing the IP/TCP headers (new_hdr)
>> >- a payload segment pointing on the same ext buffer
>> >
>> >In theory, there is no reason that orig_hdr should be referenced by
>> >another new mbuf, because it only contains headers (no data). If that's
>> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
>> >is a bug.
>> 
>> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
>> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
>
>In this case, orig_hdr has to be freed because it is a direct mbuf (not
>shared).
>The buffer pointed by orig_payload will be freed when all newly created
>segments are freed.
>
>
>> >
>> >Anyway, there is maybe no issue in that case, but I was just suggesting
>> >that using rte_pktmbuf_free() is easier to read, and safer than manually
>> >decreasing the refcnt of each segment.
>> >
>> >
>> >> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >> >in this case I really think that having a simple example code that
>> >> >> >shows the issue would help.
>> >> >> 
>> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> >> >> unit test can show it.
>> >> >
>> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >> >a unit test couldn't call the same functions?
>> >> >
>> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
>> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> >> >> free_cb call which will free original large & extended mbuf.
>> >> >
>> >> >OK
>> >> >
>> >> >> Case two is PMD driver will call free against
>> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
>> >> >
>> >> >OK
>> >> >
>> >> >And what makes that case 1 or case 2 is executed?
>> >> >
>> >> >> In case one free_cb must call
>> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>> >> >
>> >> >I think there is no need to change the free_cb API. It should work like
>> >> >this:
>> >> >
>> >> >- virtio creates the original external mbuf (orig_m)
>> >> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >> >
>> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >> >whatever the order.
>> >> >
>> >> >Regards,
>> >> >Olivier
>> >> 
>> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
>> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
>> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>> >
>> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>> >The free_cb is to free the buffer, not the mbuf.
>> >
>> >To me, it should work like this:
>> >
>> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>> >   bug should be fixed)
>> >2- gso create mbufs that reference the the same ext buf (by attaching the
>> >   new mbuf)
>> >3- gso must free the original mbuf
>> 
>> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
>> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
>> you know how to do this.
>
>As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
>counters on all segments. The segments will be returned to the pool if the
>refcnt reaches 0.
>
>> 
>> >4- the PMD transmits the new mbufs, and frees them
>> >
>> >Whatever 3- or 4- is executed first, at the end we are sure that:
>> >- all mbufs will be returned to the pool
>> >- the linear buffer will be freed when the refcnt reaches 0.
>> >
>> >If this is still unclear, please, write a unit test like I did
>> >above to show your issue.
>> >
>> >Regards,
>> >Olivier
>> >
>> 
>> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
>> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
>> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
>> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
>> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
>> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
>
>Sorry, but I don't see how I can be clearer to what I explained
>in my previous answer.
>
>Please, *provide a simple test example* without gso/vhost, and I can help
>to make it work.
>
>
>Regards,
>Olivier
>
>
>> >
>> >
>> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
>> >> 
>> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
>> >> pointer), it isn't a problem at all.
>> >> 
>> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
>> >> 
>> >> struct shinfo_arg {
>> >>        void *buf;
>> >>        struct rte_mbuf *mbuf;
>> >> };
>> >> 
>> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> >> {
>> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> >> 
>> >>        rte_free(arg->buf);
>> >>        if (caller_m != arg->mbuf)
>> >>                rte_pktmbuf_free(arg->mbuf);
>> >>        rte_free(arg);
>> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-10-09  9:51                       ` yang_y_yi
@ 2020-10-09 11:55                         ` Olivier Matz
  2020-10-10  1:49                           ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-10-09 11:55 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi,
On Fri, Oct 09, 2020 at 05:51:23PM +0800, yang_y_yi wrote:
> Olivier, thank you so much for your reply, your patch post for vhost help me understand your concern better, I totally agree. For GSO case, let me show you a simple code to explain my issue.
> 
> 
> 
> 
> 
> struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> virtio_dev_extbuf_alloc(pkt, data_len)
> struct rte_mbuf * pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
> 
> 
> rte_pktmbuf_attach(pkt_seg1, pkt);
> rte_mbuf_ext_refcnt_update(pkt, -1);
> 
> struct rte_mbuf * pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
> rte_pktmbuf_attach(pkt_seg2, pkt);
> rte_mbuf_ext_refcnt_update(pkt, -1);
> struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
> 
> rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
> 
> 
> Is it a simple test you expect? The issue here is nobody explicitly calls
> rte_pktmbuf_free(pkt), rte_pktmbuf_free(pkt_segX) in PMD driver won't free
> "pkt", Is it clear to you now?
Thank you for the small code. Yes, this is what I expected.
The proper way to do this is something like this:
	/* create a mbuf, and attach it to an external buffer */
	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
	virtio_dev_extbuf_alloc(pkt, data_len)
	/* create a new mbuf, attach it to the previous one: the resulting
	 * mbuf is also an "external mbuf" (is has the EXT_ATTACHED_MBUF
	 * flag, and its data is stored in the ext buffer.
	 * See an example here: https://www.droids-corp.org/~zer0/ext-mbuf.svg
	 */
	struct rte_mbuf *pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
	rte_pktmbuf_attach(pkt_seg1, pkt);
	/* do the same another time */
	struct rte_mbuf *pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
	rte_pktmbuf_attach(pkt_seg2, pkt);
	/* release the original pkt, we don't need it anymore */
	rte_pktmbuf_free(pkt);
	/* send the new segments, they will be freed by the driver once
	 * Tx is done. When the last packet referencing the external buffer
	 * is freed, the free callback of the buffer will be invoked. */
	struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
	rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
Hope this is clearer now.
Regards,
Olivier
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> At 2020-10-07 17:48:21, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >Hi,
> >
> >On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
> >> Per GSO requirement, this is a must-have change, Jiayu, can you help review
> >> this series?
> >
> >I can't ack this patch until I have a simple and clear test case (only
> >with mbuf functions, without GSO or vhost) showing the issue we have
> >today with current.
> >
> >> Olivier, if you used the old interface, maybe you need to change your code to
> >> adapt this, I don't think we can have a better way to handle GSO case.
> >
> >Sorry, I don't get your point. What do I need to change in which code?
> >
> >(some more comments below)
> >
> >> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
> >> 
> >> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
> >> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> >> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >Hi,
> >> >> >> >
> >> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >> >> >> >> 
> >> >> >> >> 
> >> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >> >Hi,
> >> >> >> >> >
> >> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >> >> >> >> 
> >> >> >> >> >> In GSO case, segmented mbufs are attached to original
> >> >> >> >> >> mbuf which can't be freed when it is external. The issue
> >> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> >> >> >> it when refcnt of shinfo is 0.
> >> >> >> >> >> 
> >> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> >> >> >> cases should have different behaviors. free_cb won't
> >> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >> >> >> >> 
> >> >> >> >> >> In order to fix this issue, free_cb interface has to been
> >> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> >> >> >> as a custom struct by user, it can includes original mbuf
> >> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> >> >> >> to free it.
> >> >> >> >> >> 
> >> >> >> >> >> Here is pseduo code to show two kind of cases.
> >> >> >> >> >> 
> >> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >> >> >> >> 
> >> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >> >> >> >> >>                       &gso_ctx,
> >> >> >> >> >>                       /* segmented mbuf */
> >> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >> >> >> >>                       MAX_GSO_MBUFS);
> >> >> >> >> >
> >> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >> >> >> >rte_gso_segment().
> >> >> >> >> >
> >> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >> >> >> >which do not deal with external buffers. Am I missing something?
> >> >> >> >> >
> >> >> >> >> >Are you able to show the issue only with mbuf functions? It would
> >> >> >> >> >be helpful to understand what does not work.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >Thanks,
> >> >> >> >> >Olivier
> >> >> >> >> >
> >> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> >> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> >> >> >> >
> >> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
> >> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
> >> >> >> >very helpful because we will be sure that we are talking about the same
> >> >> >> >thing. In case there is an issue, it can easily become a unit test.
> >> >> >> 
> >> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
> >> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
> >> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
> >> >> >> will call rte_eth_tx_burst to send them finally.
> >> >> >>
> >> >> >> >
> >> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >> >> >> >I found some strange things:
> >> >> >> >
> >> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >> >> >> >   ext mbuf.
> >> >> >> >
> >> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >> >> >> >      like this:
> >> >> >> >
> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
> >> >> >> >	shinfo->fcb_opaque = buf;
> >> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >> >> >> >
> >> >> >> >      I don't think it is a good idea, because there is no guarantee that
> >> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
> >> >> >> >      this will probably fail:
> >> >> >> >
> >> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
> >> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
> >> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> >> >> >> 
> >> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
> >> >> >
> >> >> >This example shows that mbufs allocated like this by the vhost
> >> >> >driver are not constructed correctly. If an application attach a new
> >> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
> >> >> >memory corruption.
> >> >> >
> >> >> >Of course, to be tested and confirmed.
> >> >> 
> >> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
> >> >> 0, isn't it?
> >> >
> >> >When pkt will be freed, it will decrement the shinfo refcnt, and
> >> >after it will be 1. So the buffer won't be freed. After that, the
> >> >mbuf pkt will be detached, and will return to the mbuf pool. It means
> >> >it can be reallocated, and the next user can overwrite shinfo which
> >> >is still stored in the mbuf data.
> >> 
> >> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
> >> overwrite it, he/she should use the same way as a new external buffer is attached. 
> >
> >No, there is no issue in DPDK. The lifetime of shinfo should be at least
> >the same the lifetime of the buffer. If shinfo is stored in initial mbuf
> >(called "pkt" in the example above), the mbuf and shinfo can be freed while
> >the buffer is still in use by another packet.
> >
> >> >I did a test to show it, see:
> >> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
> >> >
> >> >If you run the mbuf autotest, it segfaults.
> >> 
> >> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
> >
> >I'm changing the data of a newly allocated mbuf, it is perfectly legal.
> >I happens that it points the the shinfo that is still in use.
> >
> >
> >> 
> >> static inline void
> >> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> >>         rte_iova_t buf_iova, uint16_t buf_len,
> >>         struct rte_mbuf_ext_shared_info *shinfo)
> >> {
> >>         /* mbuf should not be read-only */
> >>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> >>         RTE_ASSERT(shinfo->free_cb != NULL);
> >> 
> >> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
> >> 
> >> >
> >> >
> >> >> 
> >> >> >
> >> >> >> 
> >> >> >> >
> >> >> >> >      To do this properly, the mbuf refcnt should be increased, and
> >> >> >> >      the mbuf should be freed in the callback. But I don't think it's
> >> >> >> >      worth doing it, given the second path (b/) looks good to me.
> >> >> >> >
> >> >> >> >   b/ The second path stores the shinfo struct at the end of the
> >> >> >> >      allocated buffer, like this:
> >> >> >> >
> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> >> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >> >> >> >					      virtio_dev_extbuf_free, buf);
> >> >> >> >
> >> >> >> >      I think this is correct, because we have the guarantee that shinfo
> >> >> >> >      exists as long as the buffer exists.
> >> >> >> 
> >> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
> >> >> >> free original mbuf which owns shinfo buffer.
> >> >> >
> >> >> >I don't get your question.
> >> >> >
> >> >> >I'm just saying that this code path looks correct, compared to
> >> >> >the previous one.
> >> >> 
> >> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
> >> >
> >> >I'm not challenging anything, I'm saying there is a bug in this code,
> >> >and the unit test above tends to confirm it.
> >> 
> >> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
> >> use it in correct way, I don't think it is a bug.
> >
> >I'll submit a patch for this.
> >
> >The point is you are testing GSO on top of wrongly-constructed mbufs, so
> >it would be safer for you to fix this before doing more tests.
> >
> >
> >> >> >> >2/ in rte_gso_segment(), there is a loop like this:
> >> >> >> >
> >> >> >> >	while (pkt_seg) {
> >> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >> >> >> >		pkt_seg = pkt_seg->next;
> >> >> >> >	}
> >> >> >> >
> >> >> >> >   You change it to take in account the refcnt for ext mbufs.
> >> >> >> >
> >> >> >> >   I may have missed something but I wonder why it is not simply:
> >> >> >> >
> >> >> >> >	rte_pktmbuf_free(pkt_seg);
> >> >> >> >
> >> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
> >> >> >> >   are not used.
> >> >> >> 
> >> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
> >> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
> >> >> >
> >> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
> >> >> >when the refcnt reaches 0.
> >> >> >
> >> >> >It has some advantages compared to decrease the reference counter of all
> >> >> >segments:
> >> >> >
> >> >> >- no need to iterate the segments, there is only one function call
> >> >> >- no need to have a special case for ext mbufs like you did in your patch
> >> >> >- it may be safer, in case some segments have a refcnt == 1, because
> >> >> >  resources will be freed.
> >> >> 
> >> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
> >> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
> >> >
> >> >I suppose rte_gso_segment() can take any mbuf type as input: standard
> >> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
> >> >different types.
> >> >
> >> >For instance, if you pass a chain of 2 mbufs:
> >> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
> >> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
> >> >
> >> >I expect that the resulting mbuf after calling gso contains a list of mbufs
> >> >like this:
> >> >- a first segment containing the IP/TCP headers (new_hdr)
> >> >- a payload segment pointing on the same ext buffer
> >> >
> >> >In theory, there is no reason that orig_hdr should be referenced by
> >> >another new mbuf, because it only contains headers (no data). If that's
> >> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
> >> >is a bug.
> >> 
> >> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
> >> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
> >
> >In this case, orig_hdr has to be freed because it is a direct mbuf (not
> >shared).
> >The buffer pointed by orig_payload will be freed when all newly created
> >segments are freed.
> >
> >
> >> >
> >> >Anyway, there is maybe no issue in that case, but I was just suggesting
> >> >that using rte_pktmbuf_free() is easier to read, and safer than manually
> >> >decreasing the refcnt of each segment.
> >> >
> >> >
> >> >> >> >Again, sorry if this is not the issue your are referring to, but
> >> >> >> >in this case I really think that having a simple example code that
> >> >> >> >shows the issue would help.
> >> >> >> 
> >> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
> >> >> >> unit test can show it.
> >> >> >
> >> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
> >> >> >a unit test couldn't call the same functions?
> >> >> >
> >> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
> >> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
> >> >> >> free_cb call which will free original large & extended mbuf.
> >> >> >
> >> >> >OK
> >> >> >
> >> >> >> Case two is PMD driver will call free against
> >> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
> >> >> >
> >> >> >OK
> >> >> >
> >> >> >And what makes that case 1 or case 2 is executed?
> >> >> >
> >> >> >> In case one free_cb must call
> >> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
> >> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
> >> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
> >> >> >
> >> >> >I think there is no need to change the free_cb API. It should work like
> >> >> >this:
> >> >> >
> >> >> >- virtio creates the original external mbuf (orig_m)
> >> >> >- gso will create a new mbuf referencing the external buffer (new_m)
> >> >> >
> >> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
> >> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
> >> >> >whatever the order.
> >> >> >
> >> >> >Regards,
> >> >> >Olivier
> >> >> 
> >> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
> >> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
> >> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
> >> >
> >> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
> >> >The free_cb is to free the buffer, not the mbuf.
> >> >
> >> >To me, it should work like this:
> >> >
> >> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
> >> >   bug should be fixed)
> >> >2- gso create mbufs that reference the the same ext buf (by attaching the
> >> >   new mbuf)
> >> >3- gso must free the original mbuf
> >> 
> >> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
> >> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
> >> you know how to do this.
> >
> >As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
> >counters on all segments. The segments will be returned to the pool if the
> >refcnt reaches 0.
> >
> >> 
> >> >4- the PMD transmits the new mbufs, and frees them
> >> >
> >> >Whatever 3- or 4- is executed first, at the end we are sure that:
> >> >- all mbufs will be returned to the pool
> >> >- the linear buffer will be freed when the refcnt reaches 0.
> >> >
> >> >If this is still unclear, please, write a unit test like I did
> >> >above to show your issue.
> >> >
> >> >Regards,
> >> >Olivier
> >> >
> >> 
> >> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
> >> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
> >> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
> >> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
> >> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
> >> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
> >
> >Sorry, but I don't see how I can be clearer to what I explained
> >in my previous answer.
> >
> >Please, *provide a simple test example* without gso/vhost, and I can help
> >to make it work.
> >
> >
> >Regards,
> >Olivier
> >
> >
> >> >
> >> >
> >> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
> >> >> 
> >> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
> >> >> pointer), it isn't a problem at all.
> >> >> 
> >> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
> >> >> 
> >> >> struct shinfo_arg {
> >> >>        void *buf;
> >> >>        struct rte_mbuf *mbuf;
> >> >> };
> >> >> 
> >> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> >> >> {
> >> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> >> >> 
> >> >>        rte_free(arg->buf);
> >> >>        if (caller_m != arg->mbuf)
> >> >>                rte_pktmbuf_free(arg->mbuf);
> >> >>        rte_free(arg);
> >> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-10-09 11:55                         ` Olivier Matz
@ 2020-10-10  1:49                           ` yang_y_yi
  2020-10-14 13:55                             ` Olivier Matz
  0 siblings, 1 reply; 21+ messages in thread
From: yang_y_yi @ 2020-10-10  1:49 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
Olivier, thank you so much for helping figure out this, it does work as the code you changed, so we can fix the issue without this patch series. My question is can it also work for normal mbuf or indirect mbuf? (I mean pkt is direct mbuf or indirect mbuf)
At 2020-10-09 19:55:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>Hi,
>
>On Fri, Oct 09, 2020 at 05:51:23PM +0800, yang_y_yi wrote:
>> Olivier, thank you so much for your reply, your patch post for vhost help me understand your concern better, I totally agree. For GSO case, let me show you a simple code to explain my issue.
>> 
>> 
>> 
>> 
>> 
>> struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>> virtio_dev_extbuf_alloc(pkt, data_len)
>> struct rte_mbuf * pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
>> 
>> 
>> rte_pktmbuf_attach(pkt_seg1, pkt);
>> rte_mbuf_ext_refcnt_update(pkt, -1);
>> 
>> struct rte_mbuf * pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
>> rte_pktmbuf_attach(pkt_seg2, pkt);
>> rte_mbuf_ext_refcnt_update(pkt, -1);
>> struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
>> 
>> rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
>> 
>> 
>> Is it a simple test you expect? The issue here is nobody explicitly calls
>> rte_pktmbuf_free(pkt), rte_pktmbuf_free(pkt_segX) in PMD driver won't free
>> "pkt", Is it clear to you now?
>
>
>Thank you for the small code. Yes, this is what I expected.
>
>The proper way to do this is something like this:
>
>	/* create a mbuf, and attach it to an external buffer */
>	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>	virtio_dev_extbuf_alloc(pkt, data_len)
>
>	/* create a new mbuf, attach it to the previous one: the resulting
>	 * mbuf is also an "external mbuf" (is has the EXT_ATTACHED_MBUF
>	 * flag, and its data is stored in the ext buffer.
>	 * See an example here: https://www.droids-corp.org/~zer0/ext-mbuf.svg
>	 */
>	struct rte_mbuf *pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
>	rte_pktmbuf_attach(pkt_seg1, pkt);
>
>	/* do the same another time */
>	struct rte_mbuf *pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
>	rte_pktmbuf_attach(pkt_seg2, pkt);
>
>	/* release the original pkt, we don't need it anymore */
>	rte_pktmbuf_free(pkt);
>
>	/* send the new segments, they will be freed by the driver once
>	 * Tx is done. When the last packet referencing the external buffer
>	 * is freed, the free callback of the buffer will be invoked. */
>	struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
>	rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
>
>Hope this is clearer now.
>
>Regards,
>Olivier
>
>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> At 2020-10-07 17:48:21, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >Hi,
>> >
>> >On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
>> >> Per GSO requirement, this is a must-have change, Jiayu, can you help review
>> >> this series?
>> >
>> >I can't ack this patch until I have a simple and clear test case (only
>> >with mbuf functions, without GSO or vhost) showing the issue we have
>> >today with current.
>> >
>> >> Olivier, if you used the old interface, maybe you need to change your code to
>> >> adapt this, I don't think we can have a better way to handle GSO case.
>> >
>> >Sorry, I don't get your point. What do I need to change in which code?
>> >
>> >(some more comments below)
>> >
>> >> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
>> >> 
>> >> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> >> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >Hi,
>> >> >> >> >
>> >> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >> >Hi,
>> >> >> >> >> >
>> >> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> >> >> >> 
>> >> >> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> >> >> 
>> >> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> >> >> 
>> >> >> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> >> >> to free it.
>> >> >> >> >> >> 
>> >> >> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> >> >> 
>> >> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> >> >> 
>> >> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >> >> >>                       &gso_ctx,
>> >> >> >> >> >>                       /* segmented mbuf */
>> >> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >> >> >
>> >> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >> >> >rte_gso_segment().
>> >> >> >> >> >
>> >> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >> >> >
>> >> >> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >> >> >be helpful to understand what does not work.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >Thanks,
>> >> >> >> >> >Olivier
>> >> >> >> >> >
>> >> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >> >> >> >
>> >> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >> >> >very helpful because we will be sure that we are talking about the same
>> >> >> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> >> >> 
>> >> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> >> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> >> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> >> >> >> will call rte_eth_tx_burst to send them finally.
>> >> >> >>
>> >> >> >> >
>> >> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >> >> >I found some strange things:
>> >> >> >> >
>> >> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >> >> >   ext mbuf.
>> >> >> >> >
>> >> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >> >> >      like this:
>> >> >> >> >
>> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >> >> >> >	shinfo->fcb_opaque = buf;
>> >> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >> >> >
>> >> >> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >> >> >      this will probably fail:
>> >> >> >> >
>> >> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
>> >> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> >> >> 
>> >> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
>> >> >> >
>> >> >> >This example shows that mbufs allocated like this by the vhost
>> >> >> >driver are not constructed correctly. If an application attach a new
>> >> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >> >> >memory corruption.
>> >> >> >
>> >> >> >Of course, to be tested and confirmed.
>> >> >> 
>> >> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
>> >> >> 0, isn't it?
>> >> >
>> >> >When pkt will be freed, it will decrement the shinfo refcnt, and
>> >> >after it will be 1. So the buffer won't be freed. After that, the
>> >> >mbuf pkt will be detached, and will return to the mbuf pool. It means
>> >> >it can be reallocated, and the next user can overwrite shinfo which
>> >> >is still stored in the mbuf data.
>> >> 
>> >> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
>> >> overwrite it, he/she should use the same way as a new external buffer is attached. 
>> >
>> >No, there is no issue in DPDK. The lifetime of shinfo should be at least
>> >the same the lifetime of the buffer. If shinfo is stored in initial mbuf
>> >(called "pkt" in the example above), the mbuf and shinfo can be freed while
>> >the buffer is still in use by another packet.
>> >
>> >> >I did a test to show it, see:
>> >> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>> >> >
>> >> >If you run the mbuf autotest, it segfaults.
>> >> 
>> >> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
>> >
>> >I'm changing the data of a newly allocated mbuf, it is perfectly legal.
>> >I happens that it points the the shinfo that is still in use.
>> >
>> >
>> >> 
>> >> static inline void
>> >> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
>> >>         rte_iova_t buf_iova, uint16_t buf_len,
>> >>         struct rte_mbuf_ext_shared_info *shinfo)
>> >> {
>> >>         /* mbuf should not be read-only */
>> >>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
>> >>         RTE_ASSERT(shinfo->free_cb != NULL);
>> >> 
>> >> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
>> >> 
>> >> >
>> >> >
>> >> >> 
>> >> >> >
>> >> >> >> 
>> >> >> >> >
>> >> >> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >> >> >
>> >> >> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >> >> >      allocated buffer, like this:
>> >> >> >> >
>> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >> >> >					      virtio_dev_extbuf_free, buf);
>> >> >> >> >
>> >> >> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >> >> >      exists as long as the buffer exists.
>> >> >> >> 
>> >> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> >> >> >> free original mbuf which owns shinfo buffer.
>> >> >> >
>> >> >> >I don't get your question.
>> >> >> >
>> >> >> >I'm just saying that this code path looks correct, compared to
>> >> >> >the previous one.
>> >> >> 
>> >> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
>> >> >
>> >> >I'm not challenging anything, I'm saying there is a bug in this code,
>> >> >and the unit test above tends to confirm it.
>> >> 
>> >> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
>> >> use it in correct way, I don't think it is a bug.
>> >
>> >I'll submit a patch for this.
>> >
>> >The point is you are testing GSO on top of wrongly-constructed mbufs, so
>> >it would be safer for you to fix this before doing more tests.
>> >
>> >
>> >> >> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >> >> >
>> >> >> >> >	while (pkt_seg) {
>> >> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >> >> >		pkt_seg = pkt_seg->next;
>> >> >> >> >	}
>> >> >> >> >
>> >> >> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >> >> >
>> >> >> >> >   I may have missed something but I wonder why it is not simply:
>> >> >> >> >
>> >> >> >> >	rte_pktmbuf_free(pkt_seg);
>> >> >> >> >
>> >> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >> >> >   are not used.
>> >> >> >> 
>> >> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> >> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >> >> >
>> >> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >> >> >when the refcnt reaches 0.
>> >> >> >
>> >> >> >It has some advantages compared to decrease the reference counter of all
>> >> >> >segments:
>> >> >> >
>> >> >> >- no need to iterate the segments, there is only one function call
>> >> >> >- no need to have a special case for ext mbufs like you did in your patch
>> >> >> >- it may be safer, in case some segments have a refcnt == 1, because
>> >> >> >  resources will be freed.
>> >> >> 
>> >> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
>> >> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>> >> >
>> >> >I suppose rte_gso_segment() can take any mbuf type as input: standard
>> >> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>> >> >different types.
>> >> >
>> >> >For instance, if you pass a chain of 2 mbufs:
>> >> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>> >> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
>> >> >
>> >> >I expect that the resulting mbuf after calling gso contains a list of mbufs
>> >> >like this:
>> >> >- a first segment containing the IP/TCP headers (new_hdr)
>> >> >- a payload segment pointing on the same ext buffer
>> >> >
>> >> >In theory, there is no reason that orig_hdr should be referenced by
>> >> >another new mbuf, because it only contains headers (no data). If that's
>> >> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
>> >> >is a bug.
>> >> 
>> >> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
>> >> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
>> >
>> >In this case, orig_hdr has to be freed because it is a direct mbuf (not
>> >shared).
>> >The buffer pointed by orig_payload will be freed when all newly created
>> >segments are freed.
>> >
>> >
>> >> >
>> >> >Anyway, there is maybe no issue in that case, but I was just suggesting
>> >> >that using rte_pktmbuf_free() is easier to read, and safer than manually
>> >> >decreasing the refcnt of each segment.
>> >> >
>> >> >
>> >> >> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >> >> >in this case I really think that having a simple example code that
>> >> >> >> >shows the issue would help.
>> >> >> >> 
>> >> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> >> >> >> unit test can show it.
>> >> >> >
>> >> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >> >> >a unit test couldn't call the same functions?
>> >> >> >
>> >> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
>> >> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> >> >> >> free_cb call which will free original large & extended mbuf.
>> >> >> >
>> >> >> >OK
>> >> >> >
>> >> >> >> Case two is PMD driver will call free against
>> >> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
>> >> >> >
>> >> >> >OK
>> >> >> >
>> >> >> >And what makes that case 1 or case 2 is executed?
>> >> >> >
>> >> >> >> In case one free_cb must call
>> >> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> >> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> >> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>> >> >> >
>> >> >> >I think there is no need to change the free_cb API. It should work like
>> >> >> >this:
>> >> >> >
>> >> >> >- virtio creates the original external mbuf (orig_m)
>> >> >> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >> >> >
>> >> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >> >> >whatever the order.
>> >> >> >
>> >> >> >Regards,
>> >> >> >Olivier
>> >> >> 
>> >> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
>> >> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
>> >> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>> >> >
>> >> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>> >> >The free_cb is to free the buffer, not the mbuf.
>> >> >
>> >> >To me, it should work like this:
>> >> >
>> >> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>> >> >   bug should be fixed)
>> >> >2- gso create mbufs that reference the the same ext buf (by attaching the
>> >> >   new mbuf)
>> >> >3- gso must free the original mbuf
>> >> 
>> >> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
>> >> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
>> >> you know how to do this.
>> >
>> >As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
>> >counters on all segments. The segments will be returned to the pool if the
>> >refcnt reaches 0.
>> >
>> >> 
>> >> >4- the PMD transmits the new mbufs, and frees them
>> >> >
>> >> >Whatever 3- or 4- is executed first, at the end we are sure that:
>> >> >- all mbufs will be returned to the pool
>> >> >- the linear buffer will be freed when the refcnt reaches 0.
>> >> >
>> >> >If this is still unclear, please, write a unit test like I did
>> >> >above to show your issue.
>> >> >
>> >> >Regards,
>> >> >Olivier
>> >> >
>> >> 
>> >> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
>> >> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
>> >> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
>> >> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
>> >> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
>> >> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
>> >
>> >Sorry, but I don't see how I can be clearer to what I explained
>> >in my previous answer.
>> >
>> >Please, *provide a simple test example* without gso/vhost, and I can help
>> >to make it work.
>> >
>> >
>> >Regards,
>> >Olivier
>> >
>> >
>> >> >
>> >> >
>> >> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
>> >> >> 
>> >> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
>> >> >> pointer), it isn't a problem at all.
>> >> >> 
>> >> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
>> >> >> 
>> >> >> struct shinfo_arg {
>> >> >>        void *buf;
>> >> >>        struct rte_mbuf *mbuf;
>> >> >> };
>> >> >> 
>> >> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> >> >> {
>> >> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> >> >> 
>> >> >>        rte_free(arg->buf);
>> >> >>        if (caller_m != arg->mbuf)
>> >> >>                rte_pktmbuf_free(arg->mbuf);
>> >> >>        rte_free(arg);
>> >> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-10-10  1:49                           ` yang_y_yi
@ 2020-10-14 13:55                             ` Olivier Matz
  2020-10-15  2:52                               ` yang_y_yi
  0 siblings, 1 reply; 21+ messages in thread
From: Olivier Matz @ 2020-10-14 13:55 UTC (permalink / raw)
  To: yang_y_yi; +Cc: dev, jiayu.hu, thomas, yangyi01
Hi,
On Sat, Oct 10, 2020 at 09:49:35AM +0800, yang_y_yi wrote:
> Olivier, thank you so much for helping figure out this, it does work
> as the code you changed, so we can fix the issue without this patch
> series. My question is can it also work for normal mbuf or indirect
> mbuf? (I mean pkt is direct mbuf or indirect mbuf)
Yes, it works for any mbuf type for pkt (direct, indirect, ext).
Happy to see it solves your issue!
Regards,
Olivier
> At 2020-10-09 19:55:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >Hi,
> >
> >On Fri, Oct 09, 2020 at 05:51:23PM +0800, yang_y_yi wrote:
> >> Olivier, thank you so much for your reply, your patch post for vhost help me understand your concern better, I totally agree. For GSO case, let me show you a simple code to explain my issue.
> >> 
> >> 
> >> 
> >> 
> >> 
> >> struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> >> virtio_dev_extbuf_alloc(pkt, data_len)
> >> struct rte_mbuf * pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
> >> 
> >> 
> >> rte_pktmbuf_attach(pkt_seg1, pkt);
> >> rte_mbuf_ext_refcnt_update(pkt, -1);
> >> 
> >> struct rte_mbuf * pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
> >> rte_pktmbuf_attach(pkt_seg2, pkt);
> >> rte_mbuf_ext_refcnt_update(pkt, -1);
> >> struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
> >> 
> >> rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
> >> 
> >> 
> >> Is it a simple test you expect? The issue here is nobody explicitly calls
> >> rte_pktmbuf_free(pkt), rte_pktmbuf_free(pkt_segX) in PMD driver won't free
> >> "pkt", Is it clear to you now?
> >
> >
> >Thank you for the small code. Yes, this is what I expected.
> >
> >The proper way to do this is something like this:
> >
> >	/* create a mbuf, and attach it to an external buffer */
> >	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> >	virtio_dev_extbuf_alloc(pkt, data_len)
> >
> >	/* create a new mbuf, attach it to the previous one: the resulting
> >	 * mbuf is also an "external mbuf" (is has the EXT_ATTACHED_MBUF
> >	 * flag, and its data is stored in the ext buffer.
> >	 * See an example here: https://www.droids-corp.org/~zer0/ext-mbuf.svg
> >	 */
> >	struct rte_mbuf *pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
> >	rte_pktmbuf_attach(pkt_seg1, pkt);
> >
> >	/* do the same another time */
> >	struct rte_mbuf *pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
> >	rte_pktmbuf_attach(pkt_seg2, pkt);
> >
> >	/* release the original pkt, we don't need it anymore */
> >	rte_pktmbuf_free(pkt);
> >
> >	/* send the new segments, they will be freed by the driver once
> >	 * Tx is done. When the last packet referencing the external buffer
> >	 * is freed, the free callback of the buffer will be invoked. */
> >	struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
> >	rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
> >
> >Hope this is clearer now.
> >
> >Regards,
> >Olivier
> >
> >
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> At 2020-10-07 17:48:21, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >Hi,
> >> >
> >> >On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
> >> >> Per GSO requirement, this is a must-have change, Jiayu, can you help review
> >> >> this series?
> >> >
> >> >I can't ack this patch until I have a simple and clear test case (only
> >> >with mbuf functions, without GSO or vhost) showing the issue we have
> >> >today with current.
> >> >
> >> >> Olivier, if you used the old interface, maybe you need to change your code to
> >> >> adapt this, I don't think we can have a better way to handle GSO case.
> >> >
> >> >Sorry, I don't get your point. What do I need to change in which code?
> >> >
> >> >(some more comments below)
> >> >
> >> >> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
> >> >> 
> >> >> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
> >> >> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> >> >> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >> >Hi,
> >> >> >> >> >
> >> >> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >> >> >> >> >> 
> >> >> >> >> >> 
> >> >> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >> >> >> >> >Hi,
> >> >> >> >> >> >
> >> >> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >> >> >> >> >> 
> >> >> >> >> >> >> In GSO case, segmented mbufs are attached to original
> >> >> >> >> >> >> mbuf which can't be freed when it is external. The issue
> >> >> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> >> >> >> >> it when refcnt of shinfo is 0.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> >> >> >> >> cases should have different behaviors. free_cb won't
> >> >> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> In order to fix this issue, free_cb interface has to been
> >> >> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> >> >> >> >> as a custom struct by user, it can includes original mbuf
> >> >> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> >> >> >> >> to free it.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> Here is pseduo code to show two kind of cases.
> >> >> >> >> >> >> 
> >> >> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >> >> >> >> >> 
> >> >> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
> >> >> >> >> >> >>                       &gso_ctx,
> >> >> >> >> >> >>                       /* segmented mbuf */
> >> >> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >> >> >> >> >>                       MAX_GSO_MBUFS);
> >> >> >> >> >> >
> >> >> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >> >> >> >> >rte_gso_segment().
> >> >> >> >> >> >
> >> >> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >> >> >> >> >which do not deal with external buffers. Am I missing something?
> >> >> >> >> >> >
> >> >> >> >> >> >Are you able to show the issue only with mbuf functions? It would
> >> >> >> >> >> >be helpful to understand what does not work.
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >Thanks,
> >> >> >> >> >> >Olivier
> >> >> >> >> >> >
> >> >> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
> >> >> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
> >> >> >> >> >
> >> >> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
> >> >> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
> >> >> >> >> >very helpful because we will be sure that we are talking about the same
> >> >> >> >> >thing. In case there is an issue, it can easily become a unit test.
> >> >> >> >> 
> >> >> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
> >> >> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
> >> >> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
> >> >> >> >> will call rte_eth_tx_burst to send them finally.
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >> >> >> >> >I found some strange things:
> >> >> >> >> >
> >> >> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >> >> >> >> >   ext mbuf.
> >> >> >> >> >
> >> >> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >> >> >> >> >      like this:
> >> >> >> >> >
> >> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
> >> >> >> >> >	shinfo->fcb_opaque = buf;
> >> >> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >> >> >> >> >
> >> >> >> >> >      I don't think it is a good idea, because there is no guarantee that
> >> >> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
> >> >> >> >> >      this will probably fail:
> >> >> >> >> >
> >> >> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
> >> >> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
> >> >> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
> >> >> >> >> 
> >> >> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
> >> >> >> >
> >> >> >> >This example shows that mbufs allocated like this by the vhost
> >> >> >> >driver are not constructed correctly. If an application attach a new
> >> >> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
> >> >> >> >memory corruption.
> >> >> >> >
> >> >> >> >Of course, to be tested and confirmed.
> >> >> >> 
> >> >> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
> >> >> >> 0, isn't it?
> >> >> >
> >> >> >When pkt will be freed, it will decrement the shinfo refcnt, and
> >> >> >after it will be 1. So the buffer won't be freed. After that, the
> >> >> >mbuf pkt will be detached, and will return to the mbuf pool. It means
> >> >> >it can be reallocated, and the next user can overwrite shinfo which
> >> >> >is still stored in the mbuf data.
> >> >> 
> >> >> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
> >> >> overwrite it, he/she should use the same way as a new external buffer is attached. 
> >> >
> >> >No, there is no issue in DPDK. The lifetime of shinfo should be at least
> >> >the same the lifetime of the buffer. If shinfo is stored in initial mbuf
> >> >(called "pkt" in the example above), the mbuf and shinfo can be freed while
> >> >the buffer is still in use by another packet.
> >> >
> >> >> >I did a test to show it, see:
> >> >> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
> >> >> >
> >> >> >If you run the mbuf autotest, it segfaults.
> >> >> 
> >> >> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
> >> >
> >> >I'm changing the data of a newly allocated mbuf, it is perfectly legal.
> >> >I happens that it points the the shinfo that is still in use.
> >> >
> >> >
> >> >> 
> >> >> static inline void
> >> >> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> >> >>         rte_iova_t buf_iova, uint16_t buf_len,
> >> >>         struct rte_mbuf_ext_shared_info *shinfo)
> >> >> {
> >> >>         /* mbuf should not be read-only */
> >> >>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
> >> >>         RTE_ASSERT(shinfo->free_cb != NULL);
> >> >> 
> >> >> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
> >> >> 
> >> >> >
> >> >> >
> >> >> >> 
> >> >> >> >
> >> >> >> >> 
> >> >> >> >> >
> >> >> >> >> >      To do this properly, the mbuf refcnt should be increased, and
> >> >> >> >> >      the mbuf should be freed in the callback. But I don't think it's
> >> >> >> >> >      worth doing it, given the second path (b/) looks good to me.
> >> >> >> >> >
> >> >> >> >> >   b/ The second path stores the shinfo struct at the end of the
> >> >> >> >> >      allocated buffer, like this:
> >> >> >> >> >
> >> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
> >> >> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
> >> >> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >> >> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >> >> >> >> >					      virtio_dev_extbuf_free, buf);
> >> >> >> >> >
> >> >> >> >> >      I think this is correct, because we have the guarantee that shinfo
> >> >> >> >> >      exists as long as the buffer exists.
> >> >> >> >> 
> >> >> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
> >> >> >> >> free original mbuf which owns shinfo buffer.
> >> >> >> >
> >> >> >> >I don't get your question.
> >> >> >> >
> >> >> >> >I'm just saying that this code path looks correct, compared to
> >> >> >> >the previous one.
> >> >> >> 
> >> >> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
> >> >> >
> >> >> >I'm not challenging anything, I'm saying there is a bug in this code,
> >> >> >and the unit test above tends to confirm it.
> >> >> 
> >> >> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
> >> >> use it in correct way, I don't think it is a bug.
> >> >
> >> >I'll submit a patch for this.
> >> >
> >> >The point is you are testing GSO on top of wrongly-constructed mbufs, so
> >> >it would be safer for you to fix this before doing more tests.
> >> >
> >> >
> >> >> >> >> >2/ in rte_gso_segment(), there is a loop like this:
> >> >> >> >> >
> >> >> >> >> >	while (pkt_seg) {
> >> >> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >> >> >> >> >		pkt_seg = pkt_seg->next;
> >> >> >> >> >	}
> >> >> >> >> >
> >> >> >> >> >   You change it to take in account the refcnt for ext mbufs.
> >> >> >> >> >
> >> >> >> >> >   I may have missed something but I wonder why it is not simply:
> >> >> >> >> >
> >> >> >> >> >	rte_pktmbuf_free(pkt_seg);
> >> >> >> >> >
> >> >> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
> >> >> >> >> >   are not used.
> >> >> >> >> 
> >> >> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
> >> >> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
> >> >> >> >
> >> >> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
> >> >> >> >when the refcnt reaches 0.
> >> >> >> >
> >> >> >> >It has some advantages compared to decrease the reference counter of all
> >> >> >> >segments:
> >> >> >> >
> >> >> >> >- no need to iterate the segments, there is only one function call
> >> >> >> >- no need to have a special case for ext mbufs like you did in your patch
> >> >> >> >- it may be safer, in case some segments have a refcnt == 1, because
> >> >> >> >  resources will be freed.
> >> >> >> 
> >> >> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
> >> >> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
> >> >> >
> >> >> >I suppose rte_gso_segment() can take any mbuf type as input: standard
> >> >> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
> >> >> >different types.
> >> >> >
> >> >> >For instance, if you pass a chain of 2 mbufs:
> >> >> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
> >> >> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
> >> >> >
> >> >> >I expect that the resulting mbuf after calling gso contains a list of mbufs
> >> >> >like this:
> >> >> >- a first segment containing the IP/TCP headers (new_hdr)
> >> >> >- a payload segment pointing on the same ext buffer
> >> >> >
> >> >> >In theory, there is no reason that orig_hdr should be referenced by
> >> >> >another new mbuf, because it only contains headers (no data). If that's
> >> >> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
> >> >> >is a bug.
> >> >> 
> >> >> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
> >> >> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
> >> >
> >> >In this case, orig_hdr has to be freed because it is a direct mbuf (not
> >> >shared).
> >> >The buffer pointed by orig_payload will be freed when all newly created
> >> >segments are freed.
> >> >
> >> >
> >> >> >
> >> >> >Anyway, there is maybe no issue in that case, but I was just suggesting
> >> >> >that using rte_pktmbuf_free() is easier to read, and safer than manually
> >> >> >decreasing the refcnt of each segment.
> >> >> >
> >> >> >
> >> >> >> >> >Again, sorry if this is not the issue your are referring to, but
> >> >> >> >> >in this case I really think that having a simple example code that
> >> >> >> >> >shows the issue would help.
> >> >> >> >> 
> >> >> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
> >> >> >> >> unit test can show it.
> >> >> >> >
> >> >> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
> >> >> >> >a unit test couldn't call the same functions?
> >> >> >> >
> >> >> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
> >> >> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
> >> >> >> >> free_cb call which will free original large & extended mbuf.
> >> >> >> >
> >> >> >> >OK
> >> >> >> >
> >> >> >> >> Case two is PMD driver will call free against
> >> >> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
> >> >> >> >
> >> >> >> >OK
> >> >> >> >
> >> >> >> >And what makes that case 1 or case 2 is executed?
> >> >> >> >
> >> >> >> >> In case one free_cb must call
> >> >> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
> >> >> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
> >> >> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
> >> >> >> >
> >> >> >> >I think there is no need to change the free_cb API. It should work like
> >> >> >> >this:
> >> >> >> >
> >> >> >> >- virtio creates the original external mbuf (orig_m)
> >> >> >> >- gso will create a new mbuf referencing the external buffer (new_m)
> >> >> >> >
> >> >> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
> >> >> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
> >> >> >> >whatever the order.
> >> >> >> >
> >> >> >> >Regards,
> >> >> >> >Olivier
> >> >> >> 
> >> >> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
> >> >> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
> >> >> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
> >> >> >
> >> >> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
> >> >> >The free_cb is to free the buffer, not the mbuf.
> >> >> >
> >> >> >To me, it should work like this:
> >> >> >
> >> >> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
> >> >> >   bug should be fixed)
> >> >> >2- gso create mbufs that reference the the same ext buf (by attaching the
> >> >> >   new mbuf)
> >> >> >3- gso must free the original mbuf
> >> >> 
> >> >> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
> >> >> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
> >> >> you know how to do this.
> >> >
> >> >As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
> >> >counters on all segments. The segments will be returned to the pool if the
> >> >refcnt reaches 0.
> >> >
> >> >> 
> >> >> >4- the PMD transmits the new mbufs, and frees them
> >> >> >
> >> >> >Whatever 3- or 4- is executed first, at the end we are sure that:
> >> >> >- all mbufs will be returned to the pool
> >> >> >- the linear buffer will be freed when the refcnt reaches 0.
> >> >> >
> >> >> >If this is still unclear, please, write a unit test like I did
> >> >> >above to show your issue.
> >> >> >
> >> >> >Regards,
> >> >> >Olivier
> >> >> >
> >> >> 
> >> >> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
> >> >> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
> >> >> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
> >> >> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
> >> >> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
> >> >> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
> >> >
> >> >Sorry, but I don't see how I can be clearer to what I explained
> >> >in my previous answer.
> >> >
> >> >Please, *provide a simple test example* without gso/vhost, and I can help
> >> >to make it work.
> >> >
> >> >
> >> >Regards,
> >> >Olivier
> >> >
> >> >
> >> >> >
> >> >> >
> >> >> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
> >> >> >> 
> >> >> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
> >> >> >> pointer), it isn't a problem at all.
> >> >> >> 
> >> >> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
> >> >> >> 
> >> >> >> struct shinfo_arg {
> >> >> >>        void *buf;
> >> >> >>        struct rte_mbuf *mbuf;
> >> >> >> };
> >> >> >> 
> >> >> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
> >> >> >> {
> >> >> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
> >> >> >> 
> >> >> >>        rte_free(arg->buf);
> >> >> >>        if (caller_m != arg->mbuf)
> >> >> >>                rte_pktmbuf_free(arg->mbuf);
> >> >> >>        rte_free(arg);
> >> >> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case
  2020-10-14 13:55                             ` Olivier Matz
@ 2020-10-15  2:52                               ` yang_y_yi
  0 siblings, 0 replies; 21+ messages in thread
From: yang_y_yi @ 2020-10-15  2:52 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, jiayu.hu, thomas, yangyi01
Got it, thanks a lot.
At 2020-10-14 21:55:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>Hi,
>
>On Sat, Oct 10, 2020 at 09:49:35AM +0800, yang_y_yi wrote:
>> Olivier, thank you so much for helping figure out this, it does work
>> as the code you changed, so we can fix the issue without this patch
>> series. My question is can it also work for normal mbuf or indirect
>> mbuf? (I mean pkt is direct mbuf or indirect mbuf)
>
>Yes, it works for any mbuf type for pkt (direct, indirect, ext).
>
>Happy to see it solves your issue!
>
>Regards,
>Olivier
>
>
>
>> At 2020-10-09 19:55:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >Hi,
>> >
>> >On Fri, Oct 09, 2020 at 05:51:23PM +0800, yang_y_yi wrote:
>> >> Olivier, thank you so much for your reply, your patch post for vhost help me understand your concern better, I totally agree. For GSO case, let me show you a simple code to explain my issue.
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>> >> virtio_dev_extbuf_alloc(pkt, data_len)
>> >> struct rte_mbuf * pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
>> >> 
>> >> 
>> >> rte_pktmbuf_attach(pkt_seg1, pkt);
>> >> rte_mbuf_ext_refcnt_update(pkt, -1);
>> >> 
>> >> struct rte_mbuf * pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
>> >> rte_pktmbuf_attach(pkt_seg2, pkt);
>> >> rte_mbuf_ext_refcnt_update(pkt, -1);
>> >> struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
>> >> 
>> >> rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
>> >> 
>> >> 
>> >> Is it a simple test you expect? The issue here is nobody explicitly calls
>> >> rte_pktmbuf_free(pkt), rte_pktmbuf_free(pkt_segX) in PMD driver won't free
>> >> "pkt", Is it clear to you now?
>> >
>> >
>> >Thank you for the small code. Yes, this is what I expected.
>> >
>> >The proper way to do this is something like this:
>> >
>> >	/* create a mbuf, and attach it to an external buffer */
>> >	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>> >	virtio_dev_extbuf_alloc(pkt, data_len)
>> >
>> >	/* create a new mbuf, attach it to the previous one: the resulting
>> >	 * mbuf is also an "external mbuf" (is has the EXT_ATTACHED_MBUF
>> >	 * flag, and its data is stored in the ext buffer.
>> >	 * See an example here: https://www.droids-corp.org/~zer0/ext-mbuf.svg
>> >	 */
>> >	struct rte_mbuf *pkt_seg1 = rte_pktmbuf_alloc(indirect_pool);
>> >	rte_pktmbuf_attach(pkt_seg1, pkt);
>> >
>> >	/* do the same another time */
>> >	struct rte_mbuf *pkt_seg2 = rte_pktmbuf_alloc(indirect_pool);
>> >	rte_pktmbuf_attach(pkt_seg2, pkt);
>> >
>> >	/* release the original pkt, we don't need it anymore */
>> >	rte_pktmbuf_free(pkt);
>> >
>> >	/* send the new segments, they will be freed by the driver once
>> >	 * Tx is done. When the last packet referencing the external buffer
>> >	 * is freed, the free callback of the buffer will be invoked. */
>> >	struct rte_mbuf *pkt_segs[2] = {pkt_seg1, pkt_seg2};
>> >	rte_eth_tx_burst(dev->port_id, qid, pkt_segs, 2);
>> >
>> >Hope this is clearer now.
>> >
>> >Regards,
>> >Olivier
>> >
>> >
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> At 2020-10-07 17:48:21, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >Hi,
>> >> >
>> >> >On Sun, Sep 27, 2020 at 01:55:21PM +0800, yang_y_yi wrote:
>> >> >> Per GSO requirement, this is a must-have change, Jiayu, can you help review
>> >> >> this series?
>> >> >
>> >> >I can't ack this patch until I have a simple and clear test case (only
>> >> >with mbuf functions, without GSO or vhost) showing the issue we have
>> >> >today with current.
>> >> >
>> >> >> Olivier, if you used the old interface, maybe you need to change your code to
>> >> >> adapt this, I don't think we can have a better way to handle GSO case.
>> >> >
>> >> >Sorry, I don't get your point. What do I need to change in which code?
>> >> >
>> >> >(some more comments below)
>> >> >
>> >> >> At 2020-08-04 09:31:19, "yang_y_yi" <yang_y_yi@163.com> wrote:
>> >> >> 
>> >> >> At 2020-08-03 20:34:25, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> >> >> >> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> >> >> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >> >Hi,
>> >> >> >> >> >
>> >> >> >> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
>> >> >> >> >> >> >Hi,
>> >> >> >> >> >> >
>> >> >> >> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
>> >> >> >> >> >> >> From: Yi Yang <yangyi01@inspur.com>
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> >> >> >> to free it.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> >> >> >> 
>> >> >> >> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >> >> >> >>                       &gso_ctx,
>> >> >> >> >> >> >>                       /* segmented mbuf */
>> >> >> >> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >> >> >> >
>> >> >> >> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >> >> >> >rte_gso_segment().
>> >> >> >> >> >> >
>> >> >> >> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >> >> >> >be helpful to understand what does not work.
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >Thanks,
>> >> >> >> >> >> >Olivier
>> >> >> >> >> >> >
>> >> >> >> >> >> Oliver, thank you for comment, let me show you why it doesn't work for my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets whose size is about 64K because we enabled TSO & UFO, these large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> >> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets use the same allocate function and the same free_cb no matter they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP fragment offload, so OVS DPDK has to do it by software, for UDP case, the original rte_mbuf only can be freed by segmented rte_mbufs which are output packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement "if (caller_m != arg->mbuf)" is true for this case, this has no problem, but for TCP case, the original mbuf is delivered to rte_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be called, that means original_rte_mbuf will be freed twice, you know what will happen, this is just the issue I'm fixing. I bring in caller_m argument, it can help work around this because caller_m is arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is false, you can't fix it without the change this patch series did.
>> >> >> >> >> >
>> >> >> >> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >> >> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >> >> >> >very helpful because we will be sure that we are talking about the same
>> >> >> >> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> >> >> >> 
>> >> >> >> >> Oliver, I think you don't get the point, free operation can't be controlled by the application itself, 
>> >> >> >> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown pseudo code,
>> >> >> >> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't send them, the application
>> >> >> >> >> will call rte_eth_tx_burst to send them finally.
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >> >> >> >I found some strange things:
>> >> >> >> >> >
>> >> >> >> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >> >> >> >   ext mbuf.
>> >> >> >> >> >
>> >> >> >> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >> >> >> >      like this:
>> >> >> >> >> >
>> >> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >> >> >	shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >> >> >	shinfo->free_cb = virtio_dev_extbuf_free;
>> >> >> >> >> >	shinfo->fcb_opaque = buf;
>> >> >> >> >> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >> >> >> >
>> >> >> >> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >> >> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >> >> >> >      this will probably fail:
>> >> >> >> >> >
>> >> >> >> >> >	pkt2 = rte_pktmbuf_alloc(mp);
>> >> >> >> >> >	rte_pktmbuf_attach(pkt2, pkt);
>> >> >> >> >> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> >> >> >> 
>> >> >> >> >> pkt is created by the application I can control, so I can control it where it will be freed, right?
>> >> >> >> >
>> >> >> >> >This example shows that mbufs allocated like this by the vhost
>> >> >> >> >driver are not constructed correctly. If an application attach a new
>> >> >> >> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >> >> >> >memory corruption.
>> >> >> >> >
>> >> >> >> >Of course, to be tested and confirmed.
>> >> >> >> 
>> >> >> >> No, attach will increase refcnt of shinfo, free_cb only is called when  refcnt of shinfo is decreased to
>> >> >> >> 0, isn't it?
>> >> >> >
>> >> >> >When pkt will be freed, it will decrement the shinfo refcnt, and
>> >> >> >after it will be 1. So the buffer won't be freed. After that, the
>> >> >> >mbuf pkt will be detached, and will return to the mbuf pool. It means
>> >> >> >it can be reallocated, and the next user can overwrite shinfo which
>> >> >> >is still stored in the mbuf data.
>> >> >> 
>> >> >> I think this is an issue of DPDK itself, if external buffer in shinfo is freed, shinfo should be set to NULL, if user will
>> >> >> overwrite it, he/she should use the same way as a new external buffer is attached. 
>> >> >
>> >> >No, there is no issue in DPDK. The lifetime of shinfo should be at least
>> >> >the same the lifetime of the buffer. If shinfo is stored in initial mbuf
>> >> >(called "pkt" in the example above), the mbuf and shinfo can be freed while
>> >> >the buffer is still in use by another packet.
>> >> >
>> >> >> >I did a test to show it, see:
>> >> >> >http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>> >> >> >
>> >> >> >If you run the mbuf autotest, it segfaults.
>> >> >> 
>> >> >> I think your test is wrong, you're changing shinfo (which is being used) in wrong way, if free_bc is NULL, it will be invalid.
>> >> >
>> >> >I'm changing the data of a newly allocated mbuf, it is perfectly legal.
>> >> >I happens that it points the the shinfo that is still in use.
>> >> >
>> >> >
>> >> >> 
>> >> >> static inline void
>> >> >> rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
>> >> >>         rte_iova_t buf_iova, uint16_t buf_len,
>> >> >>         struct rte_mbuf_ext_shared_info *shinfo)
>> >> >> {
>> >> >>         /* mbuf should not be read-only */
>> >> >>         RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
>> >> >>         RTE_ASSERT(shinfo->free_cb != NULL);
>> >> >> 
>> >> >> For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you can't change it at will after that.
>> >> >> 
>> >> >> >
>> >> >> >
>> >> >> >> 
>> >> >> >> >
>> >> >> >> >> 
>> >> >> >> >> >
>> >> >> >> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >> >> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >> >> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >> >> >> >
>> >> >> >> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >> >> >> >      allocated buffer, like this:
>> >> >> >> >> >
>> >> >> >> >> >	pkt = rte_pktmbuf_alloc(mp);
>> >> >> >> >> >	buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> >> >> >> >	buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> >> >> >> >	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> >> >> >> >	shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >> >> >> >					      virtio_dev_extbuf_free, buf);
>> >> >> >> >> >
>> >> >> >> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >> >> >> >      exists as long as the buffer exists.
>> >> >> >> >> 
>> >> >> >> >> What buffer does the allocated buffer you're saying here? The issue we're discussing how we can
>> >> >> >> >> free original mbuf which owns shinfo buffer.
>> >> >> >> >
>> >> >> >> >I don't get your question.
>> >> >> >> >
>> >> >> >> >I'm just saying that this code path looks correct, compared to
>> >> >> >> >the previous one.
>> >> >> >> 
>> >> >> >> I think you're challenging principle of external mbuf, that isn't the thing I address.
>> >> >> >
>> >> >> >I'm not challenging anything, I'm saying there is a bug in this code,
>> >> >> >and the unit test above tends to confirm it.
>> >> >> 
>> >> >> If it is bug, you can post a patch to fix it,  it isn't related with my patches. But in my opinion, you don't
>> >> >> use it in correct way, I don't think it is a bug.
>> >> >
>> >> >I'll submit a patch for this.
>> >> >
>> >> >The point is you are testing GSO on top of wrongly-constructed mbufs, so
>> >> >it would be safer for you to fix this before doing more tests.
>> >> >
>> >> >
>> >> >> >> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >> >> >> >
>> >> >> >> >> >	while (pkt_seg) {
>> >> >> >> >> >		rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >> >> >> >		pkt_seg = pkt_seg->next;
>> >> >> >> >> >	}
>> >> >> >> >> >
>> >> >> >> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >> >> >> >
>> >> >> >> >> >   I may have missed something but I wonder why it is not simply:
>> >> >> >> >> >
>> >> >> >> >> >	rte_pktmbuf_free(pkt_seg);
>> >> >> >> >> >
>> >> >> >> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >> >> >> >   are not used.
>> >> >> >> >> 
>> >> >> >> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the last segmented 
>> >> >> >> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >> >> >> >
>> >> >> >> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >> >> >> >when the refcnt reaches 0.
>> >> >> >> >
>> >> >> >> >It has some advantages compared to decrease the reference counter of all
>> >> >> >> >segments:
>> >> >> >> >
>> >> >> >> >- no need to iterate the segments, there is only one function call
>> >> >> >> >- no need to have a special case for ext mbufs like you did in your patch
>> >> >> >> >- it may be safer, in case some segments have a refcnt == 1, because
>> >> >> >> >  resources will be freed.
>> >> >> >> 
>> >> >> >> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf won't be touched. For normal
>> >> >> >> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of shinfo increased.
>> >> >> >
>> >> >> >I suppose rte_gso_segment() can take any mbuf type as input: standard
>> >> >> >mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>> >> >> >different types.
>> >> >> >
>> >> >> >For instance, if you pass a chain of 2 mbufs:
>> >> >> >- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>> >> >> >- the second on is a mbuf pointing to an ext buffer (orig_payload)
>> >> >> >
>> >> >> >I expect that the resulting mbuf after calling gso contains a list of mbufs
>> >> >> >like this:
>> >> >> >- a first segment containing the IP/TCP headers (new_hdr)
>> >> >> >- a payload segment pointing on the same ext buffer
>> >> >> >
>> >> >> >In theory, there is no reason that orig_hdr should be referenced by
>> >> >> >another new mbuf, because it only contains headers (no data). If that's
>> >> >> >the case, its refcnt is 1, and decreasing it to 0 without freeing it
>> >> >> >is a bug.
>> >> >> 
>> >> >> For this user scenario, orig_m is owner of external buffer, small segmented mbufs reference
>> >> >> it, you shouldn't free orig_m before all attached segmented mbufs are freed, isn't it?
>> >> >
>> >> >In this case, orig_hdr has to be freed because it is a direct mbuf (not
>> >> >shared).
>> >> >The buffer pointed by orig_payload will be freed when all newly created
>> >> >segments are freed.
>> >> >
>> >> >
>> >> >> >
>> >> >> >Anyway, there is maybe no issue in that case, but I was just suggesting
>> >> >> >that using rte_pktmbuf_free() is easier to read, and safer than manually
>> >> >> >decreasing the refcnt of each segment.
>> >> >> >
>> >> >> >
>> >> >> >> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >> >> >> >in this case I really think that having a simple example code that
>> >> >> >> >> >shows the issue would help.
>> >> >> >> >> 
>> >> >> >> >> Oliver, my statement in the patch I sent out has pseudo code to show this.  I don't think a simple
>> >> >> >> >> unit test can show it.
>> >> >> >> >
>> >> >> >> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >> >> >> >a unit test couldn't call the same functions?
>> >> >> >> >
>> >> >> >> >> Let me summarize it here again. For original mbuf, there are two cases freeing
>> >> >> >> >> it, case one is PMD driver calls free against segmented mbufs, last segmented mbuf free will trigger
>> >> >> >> >> free_cb call which will free original large & extended mbuf.
>> >> >> >> >
>> >> >> >> >OK
>> >> >> >> >
>> >> >> >> >> Case two is PMD driver will call free against
>> >> >> >> >> original mbuf, that also will call free_cb to free attached extended buffer.
>> >> >> >> >
>> >> >> >> >OK
>> >> >> >> >
>> >> >> >> >And what makes that case 1 or case 2 is executed?
>> >> >> >> >
>> >> >> >> >> In case one free_cb must call
>> >> >> >> >> rte_pktmbuf_free otherwise nobody will free original large & extended mbuf, in case two free_cb can't 
>> >> >> >> >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_free we need. That is to say, you
>> >> >> >> >> must use the same free_cb to handle these two cases, this is my issue and the point you don't get.
>> >> >> >> >
>> >> >> >> >I think there is no need to change the free_cb API. It should work like
>> >> >> >> >this:
>> >> >> >> >
>> >> >> >> >- virtio creates the original external mbuf (orig_m)
>> >> >> >> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >> >> >> >
>> >> >> >> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >> >> >> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >> >> >> >whatever the order.
>> >> >> >> >
>> >> >> >> >Regards,
>> >> >> >> >Olivier
>> >> >> >> 
>> >> >> >> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know orig_m, how you make it free orig_m in free_cb.
>> >> >> >> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran out out buffers and orig_m isn't freed, that is why
>> >> >> >> I want to bring in this to fix the issue. Again, in case 1, nobody explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>> >> >> >
>> >> >> >If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>> >> >> >The free_cb is to free the buffer, not the mbuf.
>> >> >> >
>> >> >> >To me, it should work like this:
>> >> >> >
>> >> >> >1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>> >> >> >   bug should be fixed)
>> >> >> >2- gso create mbufs that reference the the same ext buf (by attaching the
>> >> >> >   new mbuf)
>> >> >> >3- gso must free the original mbuf
>> >> >> 
>> >> >> This is impossible, segmented mbufs are referencing external buffer in original mbuf,
>> >> >> how do you free it? As I said rte_gso_segment has no way to free it, please tell me a way if
>> >> >> you know how to do this.
>> >> >
>> >> >As I said above, calling rte_mbuf_free(orig_m) will decrement the reference
>> >> >counters on all segments. The segments will be returned to the pool if the
>> >> >refcnt reaches 0.
>> >> >
>> >> >> 
>> >> >> >4- the PMD transmits the new mbufs, and frees them
>> >> >> >
>> >> >> >Whatever 3- or 4- is executed first, at the end we are sure that:
>> >> >> >- all mbufs will be returned to the pool
>> >> >> >- the linear buffer will be freed when the refcnt reaches 0.
>> >> >> >
>> >> >> >If this is still unclear, please, write a unit test like I did
>> >> >> >above to show your issue.
>> >> >> >
>> >> >> >Regards,
>> >> >> >Olivier
>> >> >> >
>> >> >> 
>> >> >> The issue is in "3- gso must free the original mbuf", rte_pktmbuf_free(segmented_mbus) can't do it,
>> >> >> rte_gso_segment is impossible to do it, only feasible point is free_cb, please let me know if you have
>> >> >> a better way to free original mbuf and don't impact on segmented mbufs in PMD. My point is you must
>> >> >> have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is impossible  to return it to memory
>> >> >> pool, please point out  where it can be called in my user scenario. I don't care how it is done, I just care it can
>> >> >> fix my issue, please don't hesitate and send me a patch if you can, thanks a lot.
>> >> >
>> >> >Sorry, but I don't see how I can be clearer to what I explained
>> >> >in my previous answer.
>> >> >
>> >> >Please, *provide a simple test example* without gso/vhost, and I can help
>> >> >to make it work.
>> >> >
>> >> >
>> >> >Regards,
>> >> >Olivier
>> >> >
>> >> >
>> >> >> >
>> >> >> >
>> >> >> >> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m is segmented new_m, for case 2, caller_m is orig_m.
>> >> >> >> 
>> >> >> >> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf and includes multiple mbufs which are linked by next
>> >> >> >> pointer), it isn't a problem at all.
>> >> >> >> 
>> >> >> >> Please show me code how you can fix my issue if you don't change free_cb, thank you.
>> >> >> >> 
>> >> >> >> struct shinfo_arg {
>> >> >> >>        void *buf;
>> >> >> >>        struct rte_mbuf *mbuf;
>> >> >> >> };
>> >> >> >> 
>> >> >> >> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> >> >> >> {
>> >> >> >>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> >> >> >> 
>> >> >> >>        rte_free(arg->buf);
>> >> >> >>        if (caller_m != arg->mbuf)
>> >> >> >>                rte_pktmbuf_free(arg->mbuf);
>> >> >> >>        rte_free(arg);
>> >> >> >> }
^ permalink raw reply	[flat|nested] 21+ messages in thread
end of thread, other threads:[~2020-10-15  2:52 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 12:08 [dpdk-dev] [PATCH V1 0/3] Fix external mbuf free issue in GSO case yang_y_yi
2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 1/3] gso: fix refcnt update issue for external mbuf yang_y_yi
2020-07-30 12:08 ` [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case yang_y_yi
2020-07-31 15:15   ` Olivier Matz
2020-08-01 23:12     ` yang_y_yi
2020-08-02 20:29       ` Olivier Matz
2020-08-02 20:45         ` Olivier Matz
2020-08-03  1:32           ` yang_y_yi
2020-08-03  1:26         ` yang_y_yi
2020-08-03  8:11           ` Olivier Matz
2020-08-03  9:42             ` yang_y_yi
2020-08-03 12:34               ` Olivier Matz
2020-08-04  1:31                 ` yang_y_yi
2020-09-27  5:55                   ` yang_y_yi
2020-10-07  9:48                     ` Olivier Matz
2020-10-09  9:51                       ` yang_y_yi
2020-10-09 11:55                         ` Olivier Matz
2020-10-10  1:49                           ` yang_y_yi
2020-10-14 13:55                             ` Olivier Matz
2020-10-15  2:52                               ` yang_y_yi
2020-07-30 12:09 ` [dpdk-dev] [PATCH V1 3/3] vhost: use new free_cb interface to fix mbuf free issue yang_y_yi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).