From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 8187DA0518;
	Mon,  3 Aug 2020 10:11:43 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id C23DA2BD8;
	Mon,  3 Aug 2020 10:11:42 +0200 (CEST)
Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com
 [209.85.128.66]) by dpdk.org (Postfix) with ESMTP id 4726E2952
 for <dev@dpdk.org>; Mon,  3 Aug 2020 10:11:41 +0200 (CEST)
Received: by mail-wm1-f66.google.com with SMTP id c80so13479604wme.0
 for <dev@dpdk.org>; Mon, 03 Aug 2020 01:11:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=w1RfcVgCynuuPOcrNpKNKXWVHDC+AhzrYBBX9TQHtQg=;
 b=XytGQyLO11ArTzpaPFy8AmuH33M2YHOgP6fPIp1OcN5idOvaMMdNbDUl/5qans2CZa
 hhtfYU6svOURFfUopKDB6CF1zgl9aHuZY072VgtVEEDR/UWnUy1zdU+OZqeEalAZN63S
 u7Rqy8pA840jxHrZK83HfGWJfkN3FtUfopemQhaQ24/JYV0TCFGMF/iViZmZsiJmTYJp
 3wPYesBlTTyUgRIDiumorrnjJUjLdKJMLlP4IIcA4pr3PVLYRmQzJEa7r8/uewn9ik1c
 GNtSn49B1yoB3sjNyY5sdOIGapQPWHvHVfDW+5dYkfCvDcc691k5CbKBeK1Qycck7YkB
 82/A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=w1RfcVgCynuuPOcrNpKNKXWVHDC+AhzrYBBX9TQHtQg=;
 b=NqXm4DxpGN//xGCaMXTGAbnNOxHlfP9Vx6ddzKnXZ7RFJsPK8syvQR1Sc9qvfsje44
 5fbMXe42LG/mTz9ld0BEQwkhx9dM4zHgoj0G1O6LYFIeJ6tCY4xzaAM1k858rLh7uDFR
 jKS6U4XQHyYlQkOjL/WqxCUl4lChzUJgoAcwkR70q5TycIgGY2ButW1dWtQA1evSSggK
 UkSA6PGYviyTPwxeH231+Jwa+VpMYUAt8x/kgE4H6j4wG77ChQRMDrNDCkiTuW3Z77Gz
 eP1m0NDWRDTFWNRVlVnJZJQwGXOb3jreieTP7bcP2Dutu+AmloOiJW2sOEq9b6Gv1NnR
 FDpQ==
X-Gm-Message-State: AOAM533ZeUm7RUBHNwt1JGxvkw61UVDgToGnjpaktra4xERrY4ltfO0+
 RjwYgOqROIixYvpzCqkOkoDpFg==
X-Google-Smtp-Source: ABdhPJzRhx2k3wpaIIrsLKAciV1UFffARO4XyzsA7xst+p3y5/oz2bmmVQfR/Z2/4l5s6XERe342aQ==
X-Received: by 2002:a05:600c:2146:: with SMTP id
 v6mr14685096wml.87.1596442300740; 
 Mon, 03 Aug 2020 01:11:40 -0700 (PDT)
Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr.
 [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0])
 by smtp.gmail.com with ESMTPSA id q7sm24237255wra.56.2020.08.03.01.11.39
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 03 Aug 2020 01:11:40 -0700 (PDT)
Date: Mon, 3 Aug 2020 10:11:39 +0200
From: Olivier Matz <olivier.matz@6wind.com>
To: yang_y_yi <yang_y_yi@163.com>
Cc: dev@dpdk.org, jiayu.hu@intel.com, thomas@monjalon.net, yangyi01@inspur.com
Message-ID: <20200803081139.GK5869@platinum>
References: <20200730120900.108232-1-yang_y_yi@163.com>
 <20200730120900.108232-3-yang_y_yi@163.com>
 <20200731151543.GH5869@platinum>
 <2a50e80c.44f.173ac4c6d20.Coremail.yang_y_yi@163.com>
 <20200802202907.GJ5869@platinum>
 <3cf82e61.145c.173b1ed87ce.Coremail.yang_y_yi@163.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <3cf82e61.145c.173b1ed87ce.Coremail.yang_y_yi@163.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Subject: Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to
 adapt to GSO case
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >Hi,
> >
> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
> >>=20
> >>=20
> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.matz@6wind.com> wrote:
> >> >Hi,
> >> >
> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote:
> >> >> From: Yi Yang <yangyi01@inspur.com>
> >> >>=20
> >> >> In GSO case, segmented mbufs are attached to original
> >> >> mbuf which can't be freed when it is external. The issue
> >> >> is free_cb doesn't know original mbuf and doesn't free
> >> >> it when refcnt of shinfo is 0.
> >> >>=20
> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
> >> >> cases should have different behaviors. free_cb won't
> >> >> explicitly call rte_pktmbuf_free to free original mbuf
> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
> >> >> has to call rte_pktmbuf_free to free original mbuf if it
> >> >> is freed by rte_pktmbuf_free segmented mbufs.
> >> >>=20
> >> >> In order to fix this issue, free_cb interface has to been
> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
> >> >> mbuf pointer to free_cb, argument opaque can be defined
> >> >> as a custom struct by user, it can includes original mbuf
> >> >> pointer, user-defined free_cb can compare caller mbuf with
> >> >> mbuf in opaque struct, free_cb should free original mbuf
> >> >> if they are not same, this corresponds to rte_pktmbuf_free
> >> >> segmented mbufs case, otherwise, free_cb won't free original
> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
> >> >> to free it.
> >> >>=20
> >> >> Here is pseduo code to show two kind of cases.
> >> >>=20
> >> >> case 1. rte_pktmbuf_free segmented mbufs
> >> >>=20
> >> >> nb_tx =3D rte_gso_segment(original_mbuf, /* original mbuf */
> >> >>                       &gso_ctx,
> >> >>                       /* segmented mbuf */
> >> >>                       (struct rte_mbuf **)&gso_mbufs,
> >> >>                       MAX_GSO_MBUFS);
> >> >
> >> >I'm sorry but it is not very clear to me what operations are done by
> >> >rte_gso_segment().
> >> >
> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
> >> >which do not deal with external buffers. Am I missing something?
> >> >
> >> >Are you able to show the issue only with mbuf functions? It would
> >> >be helpful to understand what does not work.
> >> >
> >> >
> >> >Thanks,
> >> >Olivier
> >> >
> >> Oliver, thank you for comment, let me show you why it doesn't work for=
 my use case.  In OVS DPDK, VM uses vhostuserclient to send large packets w=
hose size is about 64K because we enabled TSO & UFO, these large packets us=
e rte_mbufs allocated by DPDK virtio_net function=20
> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. Ple=
ase refer to [PATCH V1 3/3], I changed free_cb as below, these packets use =
the same allocate function and the same free_cb no matter they are TCP pack=
et or UDP packets, in case of VXLAN TSO, most NICs can't support inner UDP =
fragment offload, so OVS DPDK has to do it by software, for UDP case, the o=
riginal rte_mbuf only can be freed by segmented rte_mbufs which are output =
packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by fr=
ee_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), the c=
ondition statement "if (caller_m !=3D arg->mbuf)" is true for this case, th=
is has no problem, but for TCP case, the original mbuf is delivered to rte_=
eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PMD d=
river will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_fre=
e(segmented_rte_mbufs), the same free_cb will be called, that means origina=
l_rte_mbuf will be freed twice, you know what will happen, this is just the=
 issue I'm fixing. I bring in caller_m argument, it can help work around th=
is because caller_m is arg->mbuf and the condition statement "if (caller_m =
!=3D arg->mbuf)" is false, you can't fix it without the change this patch s=
eries did.
> >
> >I'm sill not sure to get your issue. Please, if you have a simple test
> >case using only mbufs functions (without virtio, gso, ...), it would be
> >very helpful because we will be sure that we are talking about the same
> >thing. In case there is an issue, it can easily become a unit test.
>=20
> Oliver, I think you don't get the point, free operation can't be controll=
ed by the application itself,=20
> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown =
pseudo code,
> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't se=
nd them, the application
> will call rte_eth_tx_burst to send them finally.
>
> >
> >That said, I looked at vhost mbuf allocation and gso segmentation, and
> >I found some strange things:
> >
> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
> >   ext mbuf.
> >
> >   a/ The first one stores the shinfo struct in the mbuf, basically
> >      like this:
> >
> >	pkt =3D rte_pktmbuf_alloc(mp);
> >	shinfo =3D rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> >	buf =3D rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >	shinfo->free_cb =3D virtio_dev_extbuf_free;
> >	shinfo->fcb_opaque =3D buf;
> >	rte_mbuf_ext_refcnt_set(shinfo, 1);
> >
> >      I don't think it is a good idea, because there is no guarantee that
> >      the mbuf won't be freed before the buffer. For instance, doing
> >      this will probably fail:
> >
> >	pkt2 =3D rte_pktmbuf_alloc(mp);
> >	rte_pktmbuf_attach(pkt2, pkt);
> >	rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>=20
> pkt is created by the application I can control, so I can control it wher=
e it will be freed, right?

This example shows that mbufs allocated like this by the vhost
driver are not constructed correctly. If an application attach a new
packet (pkt2) to it and frees the original one (pkt), it may result in a
memory corruption.

Of course, to be tested and confirmed.

>=20
> >
> >      To do this properly, the mbuf refcnt should be increased, and
> >      the mbuf should be freed in the callback. But I don't think it's
> >      worth doing it, given the second path (b/) looks good to me.
> >
> >   b/ The second path stores the shinfo struct at the end of the
> >      allocated buffer, like this:
> >
> >	pkt =3D rte_pktmbuf_alloc(mp);
> >	buf_len +=3D sizeof(*shinfo) + sizeof(uintptr_t);
> >	buf_len =3D RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> >	buf =3D rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> >	shinfo =3D rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> >					      virtio_dev_extbuf_free, buf);
> >
> >      I think this is correct, because we have the guarantee that shinfo
> >      exists as long as the buffer exists.
>=20
> What buffer does the allocated buffer you're saying here? The issue we're=
 discussing how we can
> free original mbuf which owns shinfo buffer.

I don't get your question.

I'm just saying that this code path looks correct, compared to
the previous one.

>=20
> >
> >2/ in rte_gso_segment(), there is a loop like this:
> >
> >	while (pkt_seg) {
> >		rte_mbuf_refcnt_update(pkt_seg, -1);
> >		pkt_seg =3D pkt_seg->next;
> >	}
> >
> >   You change it to take in account the refcnt for ext mbufs.
> >
> >   I may have missed something but I wonder why it is not simply:
> >
> >	rte_pktmbuf_free(pkt_seg);
> >
> >   It will decrease the proper refcnt, and free the mbufs if they
> >   are not used.
>=20
> Again, rte_gso_segment just decreases refcnt by one, this will ensure the=
 last segmented=20
> mbuf free will trigger freeing original mbuf (only free_cb can do this).

rte_pktmbuf_free() will also decerase the refcnt, and free the resources
when the refcnt reaches 0.

It has some advantages compared to decrease the reference counter of all
segments:

- no need to iterate the segments, there is only one function call
- no need to have a special case for ext mbufs like you did in your patch
- it may be safer, in case some segments have a refcnt =3D=3D 1, because
  resources will be freed.

> >Again, sorry if this is not the issue your are referring to, but
> >in this case I really think that having a simple example code that
> >shows the issue would help.
>=20
> Oliver, my statement in the patch I sent out has pseudo code to show this=
=2E  I don't think a simple
> unit test can show it.

I don't see why. The PMDs and the libraries use the mbuf functions, why
a unit test couldn't call the same functions?

> Let me summarize it here again. For original mbuf, there are two cases fr=
eeing
> it, case one is PMD driver calls free against segmented mbufs, last segme=
nted mbuf free will trigger
> free_cb call which will free original large & extended mbuf.

OK

> Case two is PMD driver will call free against
> original mbuf, that also will call free_cb to free attached extended buff=
er.

OK

And what makes that case 1 or case 2 is executed?

> In case one free_cb must call
> rte_pktmbuf_free otherwise nobody will free original large & extended mbu=
f, in case two free_cb can't=20
> call rte_pktmbuf_free because the caller calling it is just rte_pktmbuf_f=
ree we need. That is to say, you
> must use the same free_cb to handle these two cases, this is my issue and=
 the point you don't get.

I think there is no need to change the free_cb API. It should work like
this:

- virtio creates the original external mbuf (orig_m)
- gso will create a new mbuf referencing the external buffer (new_m)

At this point, the shinfo has a refcnt of 2. The large buffer will be
freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
whatever the order.

Regards,
Olivier