From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id ECF80A04DD; Tue, 10 Nov 2020 17:26:52 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C91002B9D; Tue, 10 Nov 2020 17:26:51 +0100 (CET) Received: from mail-wr1-f67.google.com (mail-wr1-f67.google.com [209.85.221.67]) by dpdk.org (Postfix) with ESMTP id 0018BF64 for ; Tue, 10 Nov 2020 17:26:49 +0100 (CET) Received: by mail-wr1-f67.google.com with SMTP id c17so13324454wrc.11 for ; Tue, 10 Nov 2020 08:26:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=Sdtde22QVMyB7iFo76sSn78yO38NkAOHZfU7lqSOjng=; b=JLdFFIYi6TEUEE+OHuA3Y3GLh3NSwUshdRoEaYH42p76UBrSCeTolOIu78jgWUodyg GXBbjRdY/Zl0ZIHrprsfnyoHN6tvSAqWr5uPtHCEnIw7dC9ppu7XuFxGVhaZNOMai/mY CLfHGdGKrBwKMeoMlCFa0fVy660HQreeJJZk8jBIAfN3A89ao5katPXv8uBD9V8fZUv8 coQbUpVUA79RZ2CrvzHLyytZTOXg8lKyyrnY1109+/b2JRdTf2Jd+H9kfiNxbe8s28EJ JPaQyFy4+H1vUoZKJ31AqF+hqf7Q5SLj7CIresQKKjuVkPOoRvSPunwkVnUkVonB/RDJ va0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=Sdtde22QVMyB7iFo76sSn78yO38NkAOHZfU7lqSOjng=; b=V5RWkEXVI2QDlzAas744rjafxWJO/UbfINoLeVAHsxO6xmQUl6ahk5C5vdJp6LDoci NW2ZlktTHVXnv6VClVaUGnH1toYEiI0iWXSnLrjzIrfP4PpG5NbLMYKrTPBygjLHrSlz u7lxkl5lmpcp2+elnRm0513l+uOrJ66ot/wf3kUS8gQwfmpVT860Ad17jR5W/4dJvRlL aBWYJtG60J8HVpl1GL1vH7Zya84KoGY1XkfjcVTqsOnPEUlb9lXDoH+YaZC0UTPlf367 6MJkAk7qY6ZGQQl+n1kjsJo6TOG/2dzgEjPsvbZDUXmvboUPtTOx6t2Ho/bTvDMnxizb xrtQ== X-Gm-Message-State: AOAM531zLxPzuBy8jNhEWMLyKO51meTxP/oIlBA0OVxbFEpIk+JZe1LD 5t6a5/gG+uXwWAB0Fj9srmwQcRktnud8Og== X-Google-Smtp-Source: ABdhPJzI6gOuC4iz3lySfMo6Hezv4ulevd8XCqcPlabmPvu2Y/R8AlErlGHfJZZOc7gfFiQvSvCTow== X-Received: by 2002:adf:f608:: with SMTP id t8mr13750988wrp.72.1605025608664; Tue, 10 Nov 2020 08:26:48 -0800 (PST) Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr. [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0]) by smtp.gmail.com with ESMTPSA id w15sm1944353wrp.52.2020.11.10.08.26.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Nov 2020 08:26:47 -0800 (PST) Date: Tue, 10 Nov 2020 17:26:47 +0100 From: Olivier Matz To: "Ananyev, Konstantin" Cc: Andrew Rybchenko , Morten =?iso-8859-1?Q?Br=F8rup?= , "dev@dpdk.org" Message-ID: <20201110162647.GO1898@platinum> References: <98CBD80474FA8B44BF855DF32C47DC35C613F8@smartserver.smartshare.dk> <20201106082053.GZ1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613FA@smartserver.smartshare.dk> <20201106100437.GA1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613FF@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C61400@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] [PATCH] mbuf: fix reset on mbuf free X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Sun, Nov 08, 2020 at 02:19:55PM +0000, Ananyev, Konstantin wrote: > > > > >> > > >>>>>>>>>>>>>>>>>> Hi Olivier, > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> m->nb_seg must be reset on mbuf free > > >>>> whatever > > >>>>>> the > > >>>>>>>> value > > >>>>>>>>>> of m->next, > > >>>>>>>>>>>>>>>>>>> because it can happen that m->nb_seg is > > >> != > > >>>> 1. > > >>>>>> For > > >>>>>>>>>> instance in this > > >>>>>>>>>>>>>>>>>>> case: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> m1 = rte_pktmbuf_alloc(mp); > > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m1, 500); > > >>>>>>>>>>>>>>>>>>> m2 = rte_pktmbuf_alloc(mp); > > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m2, 500); > > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m1, m2); > > >>>>>>>>>>>>>>>>>>> m0 = rte_pktmbuf_alloc(mp); > > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m0, 500); > > >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m0, m1); > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> As rte_pktmbuf_chain() does not reset > > >>>> nb_seg in > > >>>>>> the > > >>>>>>>>>> initial m1 > > >>>>>>>>>>>>>>>>>>> segment (this is not required), after > > >> this > > >>>> code > > >>>>>> the > > >>>>>>>>>> mbuf chain > > >>>>>>>>>>>>>>>>>>> have 3 segments: > > >>>>>>>>>>>>>>>>>>> - m0: next=m1, nb_seg=3 > > >>>>>>>>>>>>>>>>>>> - m1: next=m2, nb_seg=2 > > >>>>>>>>>>>>>>>>>>> - m2: next=NULL, nb_seg=1 > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Freeing this mbuf chain will not > > >> restore > > >>>>>> nb_seg=1 > > >>>>>>>> in > > >>>>>>>>>> the second > > >>>>>>>>>>>>>>>>>>> segment. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Hmm, not sure why is that? > > >>>>>>>>>>>>>>>>>> You are talking about freeing m1, right? > > >>>>>>>>>>>>>>>>>> rte_pktmbuf_prefree_seg(struct rte_mbuf > > >> *m) > > >>>>>>>>>>>>>>>>>> { > > >>>>>>>>>>>>>>>>>> ... > > >>>>>>>>>>>>>>>>>> if (m->next != NULL) { > > >>>>>>>>>>>>>>>>>> m->next = NULL; > > >>>>>>>>>>>>>>>>>> m->nb_segs = 1; > > >>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> m1->next != NULL, so it will enter the > > >> if() > > >>>>>> block, > > >>>>>>>>>>>>>>>>>> and will reset both next and nb_segs. > > >>>>>>>>>>>>>>>>>> What I am missing here? > > >>>>>>>>>>>>>>>>>> Thinking in more generic way, that > > >> change: > > >>>>>>>>>>>>>>>>>> - if (m->next != NULL) { > > >>>>>>>>>>>>>>>>>> - m->next = NULL; > > >>>>>>>>>>>>>>>>>> - m->nb_segs = 1; > > >>>>>>>>>>>>>>>>>> - } > > >>>>>>>>>>>>>>>>>> + m->next = NULL; > > >>>>>>>>>>>>>>>>>> + m->nb_segs = 1; > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Ah, sorry. I oversimplified the example > > >> and > > >>>> now > > >>>>>> it > > >>>>>>>> does > > >>>>>>>>>> not > > >>>>>>>>>>>>>>>>> show the issue... > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> The full example also adds a split() to > > >> break > > >>>> the > > >>>>>>>> mbuf > > >>>>>>>>>> chain > > >>>>>>>>>>>>>>>>> between m1 and m2. The kind of thing that > > >>>> would > > >>>>>> be > > >>>>>>>> done > > >>>>>>>>>> for > > >>>>>>>>>>>>>>>>> software TCP segmentation. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> If so, may be the right solution is to care > > >>>> about > > >>>>>>>> nb_segs > > >>>>>>>>>>>>>>>> when next is set to NULL on split? Any > > >> place > > >>>> when > > >>>>>> next > > >>>>>>>> is > > >>>>>>>>>> set > > >>>>>>>>>>>>>>>> to NULL. Just to keep the optimization in a > > >>>> more > > >>>>>>>> generic > > >>>>>>>>>> place. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The problem with that approach is that there > > >> are > > >>>>>> already > > >>>>>>>>>> several > > >>>>>>>>>>>>>>> existing split() or trim() implementations in > > >>>>>> different > > >>>>>>>> DPDK- > > >>>>>>>>>> based > > >>>>>>>>>>>>>>> applications. For instance, we have some in > > >>>>>> 6WINDGate. If > > >>>>>>>> we > > >>>>>>>>>> force > > >>>>>>>>>>>>>>> applications to set nb_seg to 1 when > > >> resetting > > >>>> next, > > >>>>>> it > > >>>>>>>> has > > >>>>>>>>>> to be > > >>>>>>>>>>>>>>> documented because it is not straightforward. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I think it is better to go that way. > > >>>>>>>>>>>>>> From my perspective it seems natural to reset > > >>>> nb_seg at > > >>>>>>>> same > > >>>>>>>>>> time > > >>>>>>>>>>>>>> we reset next, otherwise inconsistency will > > >> occur. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> While it is not explicitly stated for nb_segs, to > > >> me > > >>>> it > > >>>>>> was > > >>>>>>>> clear > > >>>>>>>>>> that > > >>>>>>>>>>>>> nb_segs is only valid in the first segment, like > > >> for > > >>>> many > > >>>>>>>> fields > > >>>>>>>>>> (port, > > >>>>>>>>>>>>> ol_flags, vlan, rss, ...). > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> If we say that nb_segs has to be valid in any > > >>>> segments, > > >>>>>> it > > >>>>>>>> means > > >>>>>>>>>> that > > >>>>>>>>>>>>> chain() or split() will have to update it in all > > >>>>>> segments, > > >>>>>>>> which > > >>>>>>>>>> is not > > >>>>>>>>>>>>> efficient. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Why in all? > > >>>>>>>>>>>> We can state that nb_segs on non-first segment > > >> should > > >>>>>> always > > >>>>>>>> equal > > >>>>>>>>>> 1. > > >>>>>>>>>>>> As I understand in that case, both split() and > > >> chain() > > >>>> have > > >>>>>> to > > >>>>>>>>>> update nb_segs > > >>>>>>>>>>>> only for head mbufs, rest ones will remain > > >> untouched. > > >>>>>>>>>>> > > >>>>>>>>>>> Well, anyway, I think it's strange to have a > > >> constraint > > >>>> on m- > > >>>>>>>>> nb_segs > > >>>>>>>>>> for > > >>>>>>>>>>> non-first segment. We don't have that kind of > > >> constraints > > >>>> for > > >>>>>>>> other > > >>>>>>>>>> fields. > > >>>>>>>>>> > > >>>>>>>>>> True, we don't. But this is one of the fields we > > >> consider > > >>>>>> critical > > >>>>>>>>>> for proper work of mbuf alloc/free mechanism. > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> I am not sure that requiring m->nb_segs == 1 on non-first > > >>>>>> segments > > >>>>>>>> will provide any benefits. > > >>>>>>>> > > >>>>>>>> It would make this patch unneeded. > > >>>>>>>> So, for direct, non-segmented mbufs pktmbuf_free() will > > >> remain > > >>>>>> write- > > >>>>>>>> free. > > >>>>>>> > > >>>>>>> I see. Then I agree with Konstantin that alternative > > >> solutions > > >>>> should > > >>>>>> be considered. > > >>>>>>> > > >>>>>>> The benefit regarding free()'ing non-segmented mbufs - which > > >> is a > > >>>>>> very common operation - certainly outweighs the cost of > > >> requiring > > >>>>>> split()/chain() operations to set the new head mbuf's nb_segs = > > >> 1. > > >>>>>>> > > >>>>>>> Nonetheless, the bug needs to be fixed somehow. > > >>>>>>> > > >>>>>>> If we can't come up with a better solution that doesn't break > > >> the > > >>>>>> ABI, we are forced to accept the patch. > > >>>>>>> > > >>>>>>> Unless the techboard accepts to break the ABI in order to > > >> avoid > > >>>> the > > >>>>>> performance cost of this patch. > > >>>>>> > > >>>>>> Did someone notice a performance drop with this patch? > > >>>>>> On my side, I don't see any regression on a L3 use case. > > >>>>> > > >>>>> I am afraid that the DPDK performance regression tests are based > > >> on > > >>>> TX immediately following RX, so cache misses in TX may go by > > >> unnoticed > > >>>> because RX warmed up the cache for TX already. And similarly for RX > > >>>> reusing mbufs that have been warmed up by the preceding free() at > > >> TX. > > >>>>> > > >>>>> Please consider testing the performance difference with the mbuf > > >>>> being completely cold at TX, and going completely cold again before > > >>>> being reused for RX. > > >>>>> > > >>>>>> > > >>>>>> Let's sumarize: splitting a mbuf chain and freeing it causes > > >>>> subsequent > > >>>>>> mbuf > > >>>>>> allocation to return a mbuf which is not correctly initialized. > > >>>> There > > >>>>>> are 2 > > >>>>>> options to fix it: > > >>>>>> > > >>>>>> 1/ change the mbuf free function (this patch) > > >>>>>> > > >>>>>> - m->nb_segs would behave like many other field: valid in > > >> the > > >>>> first > > >>>>>> segment, ignored in other segments > > >>>>>> - may impact performance (suspected) > > >>>>>> > > >>>>>> 2/ change all places where a mbuf chain is split, or trimmed > > >>>>>> > > >>>>>> - m->nb_segs would have a specific behavior: count the > > >> number of > > >>>>>> segments in the first mbuf, should be 1 in the last > > >> segment, > > >>>>>> ignored in other ones. > > >>>>>> - no code change in mbuf library, so no performance impact > > >>>>>> - need to patch all places where we do a mbuf split or trim. > > >>>> From > > >>>>>> afar, > > >>>>>> I see at least mbuf_cut_seg_ofs() in DPDK. Some external > > >>>>>> applications > > >>>>>> may have to be patched (for instance, I already found 3 > > >> places > > >>>> in > > >>>>>> 6WIND code base without a deep search). > > >>>>>> > > >>>>>> In my opinion, 1/ is better, except we notice a significant > > >>>>>> performance, > > >>>>>> because the (implicit) behavior is unchanged. > > >>>>>> > > >>>>>> Whatever the solution, some documentation has to be added. > > >>>>>> > > >>>>>> Olivier > > >>>>>> > > >>>>> > > >>>>> Unfortunately, I don't think that anything but the first option > > >> will > > >>>> go into 20.11 and stable releases of older versions, so I stand by > > >> my > > >>>> acknowledgment of the patch. > > >>>> > > >>>> If we are affraid about 20.11 performance (it is legitimate, few > > >> days > > >>>> before the release), we can target 21.02. After all, everybody > > >> lives > > >>>> with this bug since 2017, so there is no urgency. If accepted and > > >> well > > >>>> tested, it can be backported in stable branches. > > >>> > > >>> +1 > > >>> > > >>> Good thinking, Olivier! > > >> > > >> Looking at the changes once again, it probably can be reworked a bit: > > >> > > >> - if (m->next != NULL) { > > >> - m->next = NULL; > > >> - m->nb_segs = 1; > > >> - } > > >> > > >> + if (m->next != NULL) > > >> + m->next = NULL; > > >> + if (m->nb_segs != 1) > > >> + m->nb_segs = 1; > > >> > > >> That way we add one more condition checking, but I suppose it > > >> shouldn't be that perf critical. > > >> That way for direct,non-segmented mbuf it still should be write-free. > > >> Except cases as you described above: chain(), then split(). > > >> > > >> Of-course we still need to do perf testing for that approach too. > > >> So if your preference it to postpone it till 21.02 - that's ok for me. > > >> Konstantin > > > > > > With this suggestion, I cannot imagine any performance drop for direct, non-segmented mbufs: It now reads m->nb_segs, residing in the > > mbuf's first cache line, but the function already reads m->refcnt in the first cache line; so no cache misses are introduced. > > > > +1 > > I don't expect perf drop with that approach either. > But some perf testing still needs to be done, just in case 😊 I also agree with your suggestion Konstantin. Let's postpone it right after 20.11 so we have more time to test. I'll send a v2. Olivier