From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 40FCFA0524; Fri, 6 Nov 2020 09:20:59 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 22161377E; Fri, 6 Nov 2020 09:20:57 +0100 (CET) Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com [209.85.128.68]) by dpdk.org (Postfix) with ESMTP id 668FB3772 for ; Fri, 6 Nov 2020 09:20:56 +0100 (CET) Received: by mail-wm1-f68.google.com with SMTP id 23so419970wmg.1 for ; Fri, 06 Nov 2020 00:20:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=EXBobbRQOn+2MwAxGKigSYYwHLvKYc4T23Puic6i+To=; b=iLJD/j23Zp/T88NIynVDTXn+Do7ZVrgxUZoFFbPqjECMFoljx3Ljzoan4WPEeqI6aa LMTWdWqSM0SHj+H2wv0Huq5grywbjuu+DdhQGVjWDs4Y9sDs1KdtkIT/Yt8fNBpwfxCN nFTLW8ClHNYCJ8QgK/RyIvDW09PFCmJJmcgKtEBAV6r+G0EtQTqt+UiBAUHbzaiL8mh+ 80wCks8ktII6jBXrPMhaDdxdsV7xknWBiX1tdREnC1LIt2JJ0vrSZtEyNoARZokSrTbW uUbrjNr8a870QoFHgMC0BUCR9S958apQJYPvrhVQaEEa+qUSSL3A34JZkqXVNhdosokK DNvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=EXBobbRQOn+2MwAxGKigSYYwHLvKYc4T23Puic6i+To=; b=WaMZPAKuQrLe7pgbbkZ5NmoWUeWcgSAOiDg09rKKS/FQFOKcdYo4AJVG/SjZ4dZrc7 9Et9VavRABui2/LWkeLeNzRIgiDdN7aZ20lMnXHLtsm955dGEPH7kc8GOHi748DQxSfW fCVgSEQWfY+/D6qeb3ByWPPHdIy/qLE+WEvcVz/CIgXVwexspi+j1asX1UFV0wXZgI+s z6ZZRoMgZRZysXpR4p+8wArWxPk6Ef2YP2W8GSErfEpcu+2TuQuzhuAiHUlTQJ/RyyIi N6aDUpx1tsCRoZkVmkwTBN5PgnTfPav663/zw79AjPhsMJQ0JzCG18G5AnY1LvU8Fllc gi4A== X-Gm-Message-State: AOAM5330fl+swOqC/qW8Bz3AGPyu7wgk49jiS/rkd/4ut9YOreu6htEo 3lcnKLm0zLKlwIy456Jg0cpYGA== X-Google-Smtp-Source: ABdhPJz7r8DhxiKvdqegR8/rH53m6XaaFEHFopZMBbHXKBaFsQgmuBATXsUmcVGo8PyoSkiHJxrP6g== X-Received: by 2002:a1c:790b:: with SMTP id l11mr1145235wme.53.1604650854926; Fri, 06 Nov 2020 00:20:54 -0800 (PST) Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr. [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0]) by smtp.gmail.com with ESMTPSA id n17sm925505wrw.69.2020.11.06.00.20.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 Nov 2020 00:20:54 -0800 (PST) Date: Fri, 6 Nov 2020 09:20:53 +0100 From: Olivier Matz To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: "Ananyev, Konstantin" , Andrew Rybchenko , dev@dpdk.org Message-ID: <20201106082053.GZ1898@platinum> References: <10b675b5-c376-746f-9ae2-210685b0b05d@oktetlabs.ru> <20201105091022.GO1898@platinum> <20201105123150.GP1898@platinum> <20201105132438.GS1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613F7@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C613F8@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C613F8@smartserver.smartshare.dk> User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] [PATCH] mbuf: fix reset on mbuf free X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Fri, Nov 06, 2020 at 08:52:58AM +0100, Morten Brørup wrote: > > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] > > Sent: Friday, November 6, 2020 12:55 AM > > > > > > > > > > > > >> Hi Olivier, > > > > > > > > > > >> > > > > > > > > > > >>> m->nb_seg must be reset on mbuf free whatever the > > value > > > > of m->next, > > > > > > > > > > >>> because it can happen that m->nb_seg is != 1. For > > > > instance in this > > > > > > > > > > >>> case: > > > > > > > > > > >>> > > > > > > > > > > >>> m1 = rte_pktmbuf_alloc(mp); > > > > > > > > > > >>> rte_pktmbuf_append(m1, 500); > > > > > > > > > > >>> m2 = rte_pktmbuf_alloc(mp); > > > > > > > > > > >>> rte_pktmbuf_append(m2, 500); > > > > > > > > > > >>> rte_pktmbuf_chain(m1, m2); > > > > > > > > > > >>> m0 = rte_pktmbuf_alloc(mp); > > > > > > > > > > >>> rte_pktmbuf_append(m0, 500); > > > > > > > > > > >>> rte_pktmbuf_chain(m0, m1); > > > > > > > > > > >>> > > > > > > > > > > >>> As rte_pktmbuf_chain() does not reset nb_seg in the > > > > initial m1 > > > > > > > > > > >>> segment (this is not required), after this code the > > > > mbuf chain > > > > > > > > > > >>> have 3 segments: > > > > > > > > > > >>> - m0: next=m1, nb_seg=3 > > > > > > > > > > >>> - m1: next=m2, nb_seg=2 > > > > > > > > > > >>> - m2: next=NULL, nb_seg=1 > > > > > > > > > > >>> > > > > > > > > > > >>> Freeing this mbuf chain will not restore nb_seg=1 > > in > > > > the second > > > > > > > > > > >>> segment. > > > > > > > > > > >> > > > > > > > > > > >> Hmm, not sure why is that? > > > > > > > > > > >> You are talking about freeing m1, right? > > > > > > > > > > >> rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > > > > > > > > > >> { > > > > > > > > > > >> ... > > > > > > > > > > >> if (m->next != NULL) { > > > > > > > > > > >> m->next = NULL; > > > > > > > > > > >> m->nb_segs = 1; > > > > > > > > > > >> } > > > > > > > > > > >> > > > > > > > > > > >> m1->next != NULL, so it will enter the if() block, > > > > > > > > > > >> and will reset both next and nb_segs. > > > > > > > > > > >> What I am missing here? > > > > > > > > > > >> Thinking in more generic way, that change: > > > > > > > > > > >> - if (m->next != NULL) { > > > > > > > > > > >> - m->next = NULL; > > > > > > > > > > >> - m->nb_segs = 1; > > > > > > > > > > >> - } > > > > > > > > > > >> + m->next = NULL; > > > > > > > > > > >> + m->nb_segs = 1; > > > > > > > > > > > > > > > > > > > > > > Ah, sorry. I oversimplified the example and now it > > does > > > > not > > > > > > > > > > > show the issue... > > > > > > > > > > > > > > > > > > > > > > The full example also adds a split() to break the > > mbuf > > > > chain > > > > > > > > > > > between m1 and m2. The kind of thing that would be > > done > > > > for > > > > > > > > > > > software TCP segmentation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If so, may be the right solution is to care about > > nb_segs > > > > > > > > > > when next is set to NULL on split? Any place when next > > is > > > > set > > > > > > > > > > to NULL. Just to keep the optimization in a more > > generic > > > > place. > > > > > > > > > > > > > > > > > > > > > > > > > The problem with that approach is that there are already > > > > several > > > > > > > > > existing split() or trim() implementations in different > > DPDK- > > > > based > > > > > > > > > applications. For instance, we have some in 6WINDGate. If > > we > > > > force > > > > > > > > > applications to set nb_seg to 1 when resetting next, it > > has > > > > to be > > > > > > > > > documented because it is not straightforward. > > > > > > > > > > > > > > > > I think it is better to go that way. > > > > > > > > From my perspective it seems natural to reset nb_seg at > > same > > > > time > > > > > > > > we reset next, otherwise inconsistency will occur. > > > > > > > > > > > > > > While it is not explicitly stated for nb_segs, to me it was > > clear > > > > that > > > > > > > nb_segs is only valid in the first segment, like for many > > fields > > > > (port, > > > > > > > ol_flags, vlan, rss, ...). > > > > > > > > > > > > > > If we say that nb_segs has to be valid in any segments, it > > means > > > > that > > > > > > > chain() or split() will have to update it in all segments, > > which > > > > is not > > > > > > > efficient. > > > > > > > > > > > > Why in all? > > > > > > We can state that nb_segs on non-first segment should always > > equal > > > > 1. > > > > > > As I understand in that case, both split() and chain() have to > > > > update nb_segs > > > > > > only for head mbufs, rest ones will remain untouched. > > > > > > > > > > Well, anyway, I think it's strange to have a constraint on m- > > >nb_segs > > > > for > > > > > non-first segment. We don't have that kind of constraints for > > other > > > > fields. > > > > > > > > True, we don't. But this is one of the fields we consider critical > > > > for proper work of mbuf alloc/free mechanism. > > > > > > > > > > I am not sure that requiring m->nb_segs == 1 on non-first segments > > will provide any benefits. > > > > It would make this patch unneeded. > > So, for direct, non-segmented mbufs pktmbuf_free() will remain write- > > free. > > I see. Then I agree with Konstantin that alternative solutions should be considered. > > The benefit regarding free()'ing non-segmented mbufs - which is a very common operation - certainly outweighs the cost of requiring split()/chain() operations to set the new head mbuf's nb_segs = 1. > > Nonetheless, the bug needs to be fixed somehow. > > If we can't come up with a better solution that doesn't break the ABI, we are forced to accept the patch. > > Unless the techboard accepts to break the ABI in order to avoid the performance cost of this patch. Did someone notice a performance drop with this patch? On my side, I don't see any regression on a L3 use case. Let's sumarize: splitting a mbuf chain and freeing it causes subsequent mbuf allocation to return a mbuf which is not correctly initialized. There are 2 options to fix it: 1/ change the mbuf free function (this patch) - m->nb_segs would behave like many other field: valid in the first segment, ignored in other segments - may impact performance (suspected) 2/ change all places where a mbuf chain is split, or trimmed - m->nb_segs would have a specific behavior: count the number of segments in the first mbuf, should be 1 in the last segment, ignored in other ones. - no code change in mbuf library, so no performance impact - need to patch all places where we do a mbuf split or trim. From afar, I see at least mbuf_cut_seg_ofs() in DPDK. Some external applications may have to be patched (for instance, I already found 3 places in 6WIND code base without a deep search). In my opinion, 1/ is better, except we notice a significant performance, because the (implicit) behavior is unchanged. Whatever the solution, some documentation has to be added. Olivier > > > > > > > > > E.g. the second segment of a three-segment chain will still have m- > > >next != NULL, so it cannot be used as a gate to prevent accessing m- > > > >next. > > > > > > I might have overlooked something, though. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Saying that nb_segs has to be valid for the first and last > > > > segment seems > > > > > > > really odd to me. What is the logic behind that except > > keeping > > > > this test > > > > > > > as is? > > > > > > > > > > > > > > In any case, it has to be better documented. > > > > > > > > > > > > > > > > > > > > > Olivier > > > > > > > > > > > > > > > > > > > > > > > I think the approach from > > > > > > > > > this patch is safer. > > > > > > > > > > > > > > > > It might be easier from perspective that changes in less > > places > > > > are required, > > > > > > > > Though I think that this patch will introduce some > > performance > > > > drop. > > > > > > > > As now each mbuf_prefree_seg() will cause update of 2 cache > > > > lines unconditionally. > > > > > > > > > > > > > > > > > By the way, for 21.11, if we are able to do some > > > > optimizations and have > > > > > > > > > both pool (index?) and next in the first cache line, we > > may > > > > reconsider > > > > > > > > > the fact that next and nb_segs are already set for new > > > > allocated mbufs, > > > > > > > > > because it is not straightforward either. > > > > > > > > > > > > > > > > My suggestion - let's put future optimization discussion > > aside > > > > for now, > > > > > > > > and concentrate on that particular patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > After this operation, we have 2 mbuf chain: > > > > > > > > > > > - m0 with 2 segments, the last one has next=NULL but > > > > nb_seg=2 > > > > > > > > > > > - new_m with 1 segment > > > > > > > > > > > > > > > > > > > > > > Freeing m0 will not restore nb_seg=1 in the second > > > > segment. > > > > > > > > > > > > > > > > > > > > > >> Assumes that it is ok to have an mbuf with > > > > > > > > > > >> nb_seg > 1 and next == NULL. > > > > > > > > > > >> Which seems wrong to me. > > > > > > > > > > > > > > > > > > > > > > I don't think it is wrong: nb_seg is just ignored > > when > > > > not in the first > > > > > > > > > > > segment, and there is nothing saying it should be set > > to > > > > 1. Typically, > > > > > > > > > > > rte_pktmbuf_chain() does not change it, and I guess > > it's > > > > the same for > > > > > > > > > > > many similar functions in applications. > > > > > > > > > > > > > > > > > > > > > > Olivier > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >>> This is expected that mbufs stored in pool have > > their > > > > > > > > > > >>> nb_seg field set to 1. > > > > > > > > > > >>> > > > > > > > > > > >>> Fixes: 8f094a9ac5d7 ("mbuf: set mbuf fields while > > in > > > > pool") > > > > > > > > > > >>> Cc: stable@dpdk.org > > > > > > > > > > >>> > > > > > > > > > > >>> Signed-off-by: Olivier Matz > > > > > > > > > > > > >>> --- > > > > > > > > > > >>> lib/librte_mbuf/rte_mbuf.c | 6 ++---- > > > > > > > > > > >>> lib/librte_mbuf/rte_mbuf.h | 12 ++++-------- > > > > > > > > > > >>> 2 files changed, 6 insertions(+), 12 deletions(-) > > > > > > > > > > >>> > > > > > > > > > > >>> diff --git a/lib/librte_mbuf/rte_mbuf.c > > > > b/lib/librte_mbuf/rte_mbuf.c > > > > > > > > > > >>> index 8a456e5e64..e632071c23 100644 > > > > > > > > > > >>> --- a/lib/librte_mbuf/rte_mbuf.c > > > > > > > > > > >>> +++ b/lib/librte_mbuf/rte_mbuf.c > > > > > > > > > > >>> @@ -129,10 +129,8 @@ > > > > rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque) > > > > > > > > > > >>> > > > > > > > > > > >>> rte_mbuf_ext_refcnt_set(m->shinfo, 1); > > > > > > > > > > >>> m->ol_flags = EXT_ATTACHED_MBUF; > > > > > > > > > > >>> - if (m->next != NULL) { > > > > > > > > > > >>> - m->next = NULL; > > > > > > > > > > >>> - m->nb_segs = 1; > > > > > > > > > > >>> - } > > > > > > > > > > >>> + m->next = NULL; > > > > > > > > > > >>> + m->nb_segs = 1; > > > > > > > > > > >>> rte_mbuf_raw_free(m); > > > > > > > > > > >>> } > > > > > > > > > > >>> > > > > > > > > > > >>> diff --git a/lib/librte_mbuf/rte_mbuf.h > > > > b/lib/librte_mbuf/rte_mbuf.h > > > > > > > > > > >>> index a1414ed7cd..ef5800c8ef 100644 > > > > > > > > > > >>> --- a/lib/librte_mbuf/rte_mbuf.h > > > > > > > > > > >>> +++ b/lib/librte_mbuf/rte_mbuf.h > > > > > > > > > > >>> @@ -1329,10 +1329,8 @@ > > rte_pktmbuf_prefree_seg(struct > > > > rte_mbuf *m) > > > > > > > > > > >>> return NULL; > > > > > > > > > > >>> } > > > > > > > > > > >>> > > > > > > > > > > >>> - if (m->next != NULL) { > > > > > > > > > > >>> - m->next = NULL; > > > > > > > > > > >>> - m->nb_segs = 1; > > > > > > > > > > >>> - } > > > > > > > > > > >>> + m->next = NULL; > > > > > > > > > > >>> + m->nb_segs = 1; > > > > > > > > > > >>> > > > > > > > > > > >>> return m; > > > > > > > > > > >>> > > > > > > > > > > >>> @@ -1346,10 +1344,8 @@ > > rte_pktmbuf_prefree_seg(struct > > > > rte_mbuf *m) > > > > > > > > > > >>> return NULL; > > > > > > > > > > >>> } > > > > > > > > > > >>> > > > > > > > > > > >>> - if (m->next != NULL) { > > > > > > > > > > >>> - m->next = NULL; > > > > > > > > > > >>> - m->nb_segs = 1; > > > > > > > > > > >>> - } > > > > > > > > > > >>> + m->next = NULL; > > > > > > > > > > >>> + m->nb_segs = 1; > > > > > > > > > > >>> rte_mbuf_refcnt_set(m, 1); > > > > > > > > > > >>> > > > > > > > > > > >>> return m; > > > > > > > > > > >>> -- > > > > > > > > > > >>> 2.25.1 > > > > > > > > > > >> > > > > > > > > > > >