From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id B6CA1A052A; Mon, 3 Aug 2020 14:34:29 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 085372BE6; Mon, 3 Aug 2020 14:34:29 +0200 (CEST) Received: from mail-wr1-f66.google.com (mail-wr1-f66.google.com [209.85.221.66]) by dpdk.org (Postfix) with ESMTP id BA4B52BE1 for ; Mon, 3 Aug 2020 14:34:27 +0200 (CEST) Received: by mail-wr1-f66.google.com with SMTP id l2so23494684wrc.7 for ; Mon, 03 Aug 2020 05:34:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=YI9/RRN5s/1qojXi1ktgnYUA4w7W99yp+lb9smbv8Sw=; b=NS5htgqifmXfJ8ZucE1sUSWThiChZCGP1W4x6jj3WGKvyxrnYU1qMcv1nFW6jtPwxC raq9oKBbosAsnARMvnsKsEsddPKEIElQFxtIAjH7jP7LP0fGOYcvCKUwSu4B5qXyglLH 0DOZ9Mnkrwuw5dqQacN4k0echBuM10KBfAEAMkkFZd44p25D4ox0yQpRrzWRQ0oL4YI/ p2/JlICR1EZP3tW3ajqQ5cJTYWYVJa9b36NuCDTsOTc/B/FGHoFdI7ojJ6mvdnsClntV iG+ppNwUZs+iRIqLYelPa1En+pXrwklX0vD0+zHVr8Vv/OWXSgOWTnkvF+kaKj2+0uCT c5Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=YI9/RRN5s/1qojXi1ktgnYUA4w7W99yp+lb9smbv8Sw=; b=RbQE0uaQRy4Q7OqUuwROwrv3J378k+5S+CEaYTDr9DIM23x4ApSLR/hYgdGZfNKqdL 7XNPVrOink4uK6V0zgTvYXdMAtfUbUNNn+Wfdg1uAApsh2m4HDgTkEGdN6ToLMU78EEB VJmyCYkGjjQCT1WnfZ79Vu3skY6LS0XxyZLzhDYKBHA95qHGZwFtDSW+h4MJ7Jv5tR3y LjC2636Gv9fO6LkoJXEJ7b0tidHsOvxPON3KXf5wlRUUJynVlxow7YM3Vd2Q2v+HI7g7 7b6qAjIkyop8ur4XW7/KmIn+cW4nE9t2IFLT3+QWu7iopZKn4hVvaH/wrqieGRiq8VCW eAVA== X-Gm-Message-State: AOAM533U2MMZ/KjSO/3P8wPuEZOlAmwVVeHlD0j0vpliH4VGd004qHOi k6wBvuqO9Kr+XxAUyA/IVqiLgg== X-Google-Smtp-Source: ABdhPJzkiCOqmHpf2yDS0RSZjeoZeYf/xEEiq6QmO63IsLzcKFZ1TN47rH+rMTcWKz924r6nriJhTA== X-Received: by 2002:a5d:6a04:: with SMTP id m4mr14803374wru.418.1596458067218; Mon, 03 Aug 2020 05:34:27 -0700 (PDT) Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr. [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0]) by smtp.gmail.com with ESMTPSA id l1sm26206539wrb.12.2020.08.03.05.34.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Aug 2020 05:34:26 -0700 (PDT) Date: Mon, 3 Aug 2020 14:34:25 +0200 From: Olivier Matz To: yang_y_yi Cc: dev@dpdk.org, jiayu.hu@intel.com, thomas@monjalon.net, yangyi01@inspur.com Message-ID: <20200803123425.GM5869@platinum> References: <20200730120900.108232-1-yang_y_yi@163.com> <20200730120900.108232-3-yang_y_yi@163.com> <20200731151543.GH5869@platinum> <2a50e80c.44f.173ac4c6d20.Coremail.yang_y_yi@163.com> <20200802202907.GJ5869@platinum> <3cf82e61.145c.173b1ed87ce.Coremail.yang_y_yi@163.com> <20200803081139.GK5869@platinum> <7584d005.5305.173b3b3389f.Coremail.yang_y_yi@163.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <7584d005.5305.173b3b3389f.Coremail.yang_y_yi@163.com> User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] [PATCH V1 2/3] mbuf: change free_cb interface to adapt to GSO case X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote: > At 2020-08-03 16:11:39, "Olivier Matz" wrote: > >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote: > >> At 2020-08-03 04:29:07, "Olivier Matz" wrote: > >> >Hi, > >> > > >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote: > >> >>=20 > >> >>=20 > >> >> At 2020-07-31 23:15:43, "Olivier Matz" wro= te: > >> >> >Hi, > >> >> > > >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y_yi@163.com wrote: > >> >> >> From: Yi Yang > >> >> >>=20 > >> >> >> In GSO case, segmented mbufs are attached to original > >> >> >> mbuf which can't be freed when it is external. The issue > >> >> >> is free_cb doesn't know original mbuf and doesn't free > >> >> >> it when refcnt of shinfo is 0. > >> >> >>=20 > >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented > >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of > >> >> >> cases should have different behaviors. free_cb won't > >> >> >> explicitly call rte_pktmbuf_free to free original mbuf > >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it > >> >> >> has to call rte_pktmbuf_free to free original mbuf if it > >> >> >> is freed by rte_pktmbuf_free segmented mbufs. > >> >> >>=20 > >> >> >> In order to fix this issue, free_cb interface has to been > >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called > >> >> >> mbuf pointer to free_cb, argument opaque can be defined > >> >> >> as a custom struct by user, it can includes original mbuf > >> >> >> pointer, user-defined free_cb can compare caller mbuf with > >> >> >> mbuf in opaque struct, free_cb should free original mbuf > >> >> >> if they are not same, this corresponds to rte_pktmbuf_free > >> >> >> segmented mbufs case, otherwise, free_cb won't free original > >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free > >> >> >> to free it. > >> >> >>=20 > >> >> >> Here is pseduo code to show two kind of cases. > >> >> >>=20 > >> >> >> case 1. rte_pktmbuf_free segmented mbufs > >> >> >>=20 > >> >> >> nb_tx =3D rte_gso_segment(original_mbuf, /* original mbuf */ > >> >> >> &gso_ctx, > >> >> >> /* segmented mbuf */ > >> >> >> (struct rte_mbuf **)&gso_mbufs, > >> >> >> MAX_GSO_MBUFS); > >> >> > > >> >> >I'm sorry but it is not very clear to me what operations are done = by > >> >> >rte_gso_segment(). > >> >> > > >> >> >In the current code, I only see calls to rte_pktmbuf_attach(), > >> >> >which do not deal with external buffers. Am I missing something? > >> >> > > >> >> >Are you able to show the issue only with mbuf functions? It would > >> >> >be helpful to understand what does not work. > >> >> > > >> >> > > >> >> >Thanks, > >> >> >Olivier > >> >> > > >> >> Oliver, thank you for comment, let me show you why it doesn't work = for my use case. In OVS DPDK, VM uses vhostuserclient to send large packet= s whose size is about 64K because we enabled TSO & UFO, these large packets= use rte_mbufs allocated by DPDK virtio_net function=20 > >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. = Please refer to [PATCH V1 3/3], I changed free_cb as below, these packets u= se the same allocate function and the same free_cb no matter they are TCP p= acket or UDP packets, in case of VXLAN TSO, most NICs can't support inner U= DP fragment offload, so OVS DPDK has to do it by software, for UDP case, th= e original rte_mbuf only can be freed by segmented rte_mbufs which are outp= ut packets of rte_gso_segment, i.e. the original rte_mbuf only can freed by= free_cb, you can see, it explicitly called rte_pktmbuf_free(arg->mbuf), th= e condition statement "if (caller_m !=3D arg->mbuf)" is true for this case,= this has no problem, but for TCP case, the original mbuf is delivered to r= te_eth_tx_burst() but not segmented rte_mbufs output by rte_gso_segment, PM= D driver will call rte_pktmbuf_free(original_rte_mbuf) but not rte_pktmbuf_= free(segmented_rte_mbufs), the same free_cb will be called, that means orig= inal_rte_mbuf will be freed twice, you know what will happen, this is just = the issue I'm fixing. I bring in caller_m argument, it can help work around= this because caller_m is arg->mbuf and the condition statement "if (caller= _m !=3D arg->mbuf)" is false, you can't fix it without the change this patc= h series did. > >> > > >> >I'm sill not sure to get your issue. Please, if you have a simple test > >> >case using only mbufs functions (without virtio, gso, ...), it would = be > >> >very helpful because we will be sure that we are talking about the sa= me > >> >thing. In case there is an issue, it can easily become a unit test. > >>=20 > >> Oliver, I think you don't get the point, free operation can't be contr= olled by the application itself,=20 > >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have sho= wn pseudo code, > >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't= send them, the application > >> will call rte_eth_tx_burst to send them finally. > >> > >> > > >> >That said, I looked at vhost mbuf allocation and gso segmentation, and > >> >I found some strange things: > >> > > >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the > >> > ext mbuf. > >> > > >> > a/ The first one stores the shinfo struct in the mbuf, basically > >> > like this: > >> > > >> > pkt =3D rte_pktmbuf_alloc(mp); > >> > shinfo =3D rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *); > >> > buf =3D rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE); > >> > shinfo->free_cb =3D virtio_dev_extbuf_free; > >> > shinfo->fcb_opaque =3D buf; > >> > rte_mbuf_ext_refcnt_set(shinfo, 1); > >> > > >> > I don't think it is a good idea, because there is no guarantee = that > >> > the mbuf won't be freed before the buffer. For instance, doing > >> > this will probably fail: > >> > > >> > pkt2 =3D rte_pktmbuf_alloc(mp); > >> > rte_pktmbuf_attach(pkt2, pkt); > >> > rte_pktmbuf_free(pkt); /* pkt is freed, but it contains shinfo ! */ > >>=20 > >> pkt is created by the application I can control, so I can control it w= here it will be freed, right? > > > >This example shows that mbufs allocated like this by the vhost > >driver are not constructed correctly. If an application attach a new > >packet (pkt2) to it and frees the original one (pkt), it may result in a > >memory corruption. > > > >Of course, to be tested and confirmed. >=20 > No, attach will increase refcnt of shinfo, free_cb only is called when r= efcnt of shinfo is decreased to > 0, isn't it? When pkt will be freed, it will decrement the shinfo refcnt, and after it will be 1. So the buffer won't be freed. After that, the mbuf pkt will be detached, and will return to the mbuf pool. It means it can be reallocated, and the next user can overwrite shinfo which is still stored in the mbuf data. I did a test to show it, see: http://git.droids-corp.org/?p=3Ddpdk.git;a=3Dcommitdiff;h=3Da617494eeb01ff If you run the mbuf autotest, it segfaults. >=20 > > > >>=20 > >> > > >> > To do this properly, the mbuf refcnt should be increased, and > >> > the mbuf should be freed in the callback. But I don't think it's > >> > worth doing it, given the second path (b/) looks good to me. > >> > > >> > b/ The second path stores the shinfo struct at the end of the > >> > allocated buffer, like this: > >> > > >> > pkt =3D rte_pktmbuf_alloc(mp); > >> > buf_len +=3D sizeof(*shinfo) + sizeof(uintptr_t); > >> > buf_len =3D RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t)); > >> > buf =3D rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE); > >> > shinfo =3D rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len, > >> > virtio_dev_extbuf_free, buf); > >> > > >> > I think this is correct, because we have the guarantee that shi= nfo > >> > exists as long as the buffer exists. > >>=20 > >> What buffer does the allocated buffer you're saying here? The issue we= 're discussing how we can > >> free original mbuf which owns shinfo buffer. > > > >I don't get your question. > > > >I'm just saying that this code path looks correct, compared to > >the previous one. >=20 > I think you're challenging principle of external mbuf, that isn't the thi= ng I address. I'm not challenging anything, I'm saying there is a bug in this code, and the unit test above tends to confirm it. >=20 > > > >>=20 > >> > > >> >2/ in rte_gso_segment(), there is a loop like this: > >> > > >> > while (pkt_seg) { > >> > rte_mbuf_refcnt_update(pkt_seg, -1); > >> > pkt_seg =3D pkt_seg->next; > >> > } > >> > > >> > You change it to take in account the refcnt for ext mbufs. > >> > > >> > I may have missed something but I wonder why it is not simply: > >> > > >> > rte_pktmbuf_free(pkt_seg); > >> > > >> > It will decrease the proper refcnt, and free the mbufs if they > >> > are not used. > >>=20 > >> Again, rte_gso_segment just decreases refcnt by one, this will ensure = the last segmented=20 > >> mbuf free will trigger freeing original mbuf (only free_cb can do this= ). > > > >rte_pktmbuf_free() will also decerase the refcnt, and free the resources > >when the refcnt reaches 0. > > > >It has some advantages compared to decrease the reference counter of all > >segments: > > > >- no need to iterate the segments, there is only one function call > >- no need to have a special case for ext mbufs like you did in your patch > >- it may be safer, in case some segments have a refcnt =3D=3D 1, because > > resources will be freed. >=20 > For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf= won't be touched. For normal > mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of = shinfo increased. I suppose rte_gso_segment() can take any mbuf type as input: standard mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of different types. For instance, if you pass a chain of 2 mbufs: - the first one is a direct mbuf containing the IP/TCP headers (orig_hdr) - the second on is a mbuf pointing to an ext buffer (orig_payload) I expect that the resulting mbuf after calling gso contains a list of mbufs like this: - a first segment containing the IP/TCP headers (new_hdr) - a payload segment pointing on the same ext buffer In theory, there is no reason that orig_hdr should be referenced by another new mbuf, because it only contains headers (no data). If that's the case, its refcnt is 1, and decreasing it to 0 without freeing it is a bug. Anyway, there is maybe no issue in that case, but I was just suggesting that using rte_pktmbuf_free() is easier to read, and safer than manually decreasing the refcnt of each segment. > >> >Again, sorry if this is not the issue your are referring to, but > >> >in this case I really think that having a simple example code that > >> >shows the issue would help. > >>=20 > >> Oliver, my statement in the patch I sent out has pseudo code to show t= his. I don't think a simple > >> unit test can show it. > > > >I don't see why. The PMDs and the libraries use the mbuf functions, why > >a unit test couldn't call the same functions? > > > >> Let me summarize it here again. For original mbuf, there are two cases= freeing > >> it, case one is PMD driver calls free against segmented mbufs, last se= gmented mbuf free will trigger > >> free_cb call which will free original large & extended mbuf. > > > >OK > > > >> Case two is PMD driver will call free against > >> original mbuf, that also will call free_cb to free attached extended b= uffer. > > > >OK > > > >And what makes that case 1 or case 2 is executed? > > > >> In case one free_cb must call > >> rte_pktmbuf_free otherwise nobody will free original large & extended = mbuf, in case two free_cb can't=20 > >> call rte_pktmbuf_free because the caller calling it is just rte_pktmbu= f_free we need. That is to say, you > >> must use the same free_cb to handle these two cases, this is my issue = and the point you don't get. > > > >I think there is no need to change the free_cb API. It should work like > >this: > > > >- virtio creates the original external mbuf (orig_m) > >- gso will create a new mbuf referencing the external buffer (new_m) > > > >At this point, the shinfo has a refcnt of 2. The large buffer will be > >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m, > >whatever the order. > > > >Regards, > >Olivier >=20 > Oliver, the reason it works is I changed free_cb API, case 1 doesn't know= orig_m, how you make it free orig_m in free_cb. > The intention I change free_cb is to let it know orig_m, I saw OVS DPDK r= an out out buffers and orig_m isn't freed, that is why > I want to bring in this to fix the issue. Again, in case 1, nobody explic= itly calls ret_pktmbuf_free(orig_m) except free_cb I changed. If nobody calls ret_pktmbuf_free(orig_m), it is a problem. The free_cb is to free the buffer, not the mbuf. To me, it should work like this: 1- virtio creates a mbuf attached to the ext buffer (the shinfo placement bug should be fixed) 2- gso create mbufs that reference the the same ext buf (by attaching the new mbuf) 3- gso must free the original mbuf 4- the PMD transmits the new mbufs, and frees them Whatever 3- or 4- is executed first, at the end we are sure that: - all mbufs will be returned to the pool - the linear buffer will be freed when the refcnt reaches 0. If this is still unclear, please, write a unit test like I did above to show your issue. Regards, Olivier > free_cb must handle case 1 and case 2 in the same code, for case 1, calle= r_m is segmented new_m, for case 2, caller_m is orig_m. >=20 > loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mb= uf and includes multiple mbufs which are linked by next > pointer), it isn't a problem at all. >=20 > Please show me code how you can fix my issue if you don't change free_cb,= thank you. >=20 > struct shinfo_arg { > void *buf; > struct rte_mbuf *mbuf; > }; >=20 > virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque) > { > struct shinfo_arg *arg =3D (struct shinfo_arg *)opaque; >=20 > rte_free(arg->buf); > if (caller_m !=3D arg->mbuf) > rte_pktmbuf_free(arg->mbuf); > rte_free(arg); > }