From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6B92AA04E6; Sun, 8 Nov 2020 15:16:37 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id BBDEFF90; Sun, 8 Nov 2020 15:16:34 +0100 (CET) Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113]) by dpdk.org (Postfix) with ESMTP id 17C03F64 for ; Sun, 8 Nov 2020 15:16:32 +0100 (CET) Received: from [192.168.1.192] (unknown [188.242.181.57]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by shelob.oktetlabs.ru (Postfix) with ESMTPSA id 323EB7F529; Sun, 8 Nov 2020 17:16:30 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru 323EB7F529 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru; s=default; t=1604844990; bh=sukgMpCw7Thj118YwUVF/eXmwfQ5t1kRHJtBvK+DuPA=; h=Subject:To:Cc:References:From:Date:In-Reply-To; b=wU16GCaZsxu6EoP+H+DO7tyk2oS/6PM5Wz+nT7UK4N3PGD5zWZp+t6hxDjDFQyacu ngyJl9DycPh18e7y9S1NaiKgQmr32bUUcPdcqMWMwfHfiTAqTm0uZbAPOre4Qk5ExA ReeS0GqRhgLKMkr/2bn/wpDLt7OhviVoHTo7zYjU= To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Ananyev, Konstantin" , Olivier Matz Cc: dev@dpdk.org References: <20201105123150.GP1898@platinum> <20201105132438.GS1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613F7@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C613F8@smartserver.smartshare.dk> <20201106082053.GZ1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613FA@smartserver.smartshare.dk> <20201106100437.GA1898@platinum> <98CBD80474FA8B44BF855DF32C47DC35C613FF@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C61400@smartserver.smartshare.dk> From: Andrew Rybchenko Message-ID: Date: Sun, 8 Nov 2020 17:16:20 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0 MIME-Version: 1.0 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C61400@smartserver.smartshare.dk> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] mbuf: fix reset on mbuf free X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 11/6/20 3:23 PM, Morten Brørup wrote: >> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] >> Sent: Friday, November 6, 2020 12:54 PM >> >>>>>>>>>>>>>>>>>> Hi Olivier, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> m->nb_seg must be reset on mbuf free >>>> whatever >>>>>> the >>>>>>>> value >>>>>>>>>> of m->next, >>>>>>>>>>>>>>>>>>> because it can happen that m->nb_seg is >> != >>>> 1. >>>>>> For >>>>>>>>>> instance in this >>>>>>>>>>>>>>>>>>> case: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> m1 = rte_pktmbuf_alloc(mp); >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m1, 500); >>>>>>>>>>>>>>>>>>> m2 = rte_pktmbuf_alloc(mp); >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m2, 500); >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m1, m2); >>>>>>>>>>>>>>>>>>> m0 = rte_pktmbuf_alloc(mp); >>>>>>>>>>>>>>>>>>> rte_pktmbuf_append(m0, 500); >>>>>>>>>>>>>>>>>>> rte_pktmbuf_chain(m0, m1); >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> As rte_pktmbuf_chain() does not reset >>>> nb_seg in >>>>>> the >>>>>>>>>> initial m1 >>>>>>>>>>>>>>>>>>> segment (this is not required), after >> this >>>> code >>>>>> the >>>>>>>>>> mbuf chain >>>>>>>>>>>>>>>>>>> have 3 segments: >>>>>>>>>>>>>>>>>>> - m0: next=m1, nb_seg=3 >>>>>>>>>>>>>>>>>>> - m1: next=m2, nb_seg=2 >>>>>>>>>>>>>>>>>>> - m2: next=NULL, nb_seg=1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Freeing this mbuf chain will not >> restore >>>>>> nb_seg=1 >>>>>>>> in >>>>>>>>>> the second >>>>>>>>>>>>>>>>>>> segment. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hmm, not sure why is that? >>>>>>>>>>>>>>>>>> You are talking about freeing m1, right? >>>>>>>>>>>>>>>>>> rte_pktmbuf_prefree_seg(struct rte_mbuf >> *m) >>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> if (m->next != NULL) { >>>>>>>>>>>>>>>>>> m->next = NULL; >>>>>>>>>>>>>>>>>> m->nb_segs = 1; >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> m1->next != NULL, so it will enter the >> if() >>>>>> block, >>>>>>>>>>>>>>>>>> and will reset both next and nb_segs. >>>>>>>>>>>>>>>>>> What I am missing here? >>>>>>>>>>>>>>>>>> Thinking in more generic way, that >> change: >>>>>>>>>>>>>>>>>> - if (m->next != NULL) { >>>>>>>>>>>>>>>>>> - m->next = NULL; >>>>>>>>>>>>>>>>>> - m->nb_segs = 1; >>>>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>>>> + m->next = NULL; >>>>>>>>>>>>>>>>>> + m->nb_segs = 1; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ah, sorry. I oversimplified the example >> and >>>> now >>>>>> it >>>>>>>> does >>>>>>>>>> not >>>>>>>>>>>>>>>>> show the issue... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The full example also adds a split() to >> break >>>> the >>>>>>>> mbuf >>>>>>>>>> chain >>>>>>>>>>>>>>>>> between m1 and m2. The kind of thing that >>>> would >>>>>> be >>>>>>>> done >>>>>>>>>> for >>>>>>>>>>>>>>>>> software TCP segmentation. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If so, may be the right solution is to care >>>> about >>>>>>>> nb_segs >>>>>>>>>>>>>>>> when next is set to NULL on split? Any >> place >>>> when >>>>>> next >>>>>>>> is >>>>>>>>>> set >>>>>>>>>>>>>>>> to NULL. Just to keep the optimization in a >>>> more >>>>>>>> generic >>>>>>>>>> place. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem with that approach is that there >> are >>>>>> already >>>>>>>>>> several >>>>>>>>>>>>>>> existing split() or trim() implementations in >>>>>> different >>>>>>>> DPDK- >>>>>>>>>> based >>>>>>>>>>>>>>> applications. For instance, we have some in >>>>>> 6WINDGate. If >>>>>>>> we >>>>>>>>>> force >>>>>>>>>>>>>>> applications to set nb_seg to 1 when >> resetting >>>> next, >>>>>> it >>>>>>>> has >>>>>>>>>> to be >>>>>>>>>>>>>>> documented because it is not straightforward. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think it is better to go that way. >>>>>>>>>>>>>> From my perspective it seems natural to reset >>>> nb_seg at >>>>>>>> same >>>>>>>>>> time >>>>>>>>>>>>>> we reset next, otherwise inconsistency will >> occur. >>>>>>>>>>>>> >>>>>>>>>>>>> While it is not explicitly stated for nb_segs, to >> me >>>> it >>>>>> was >>>>>>>> clear >>>>>>>>>> that >>>>>>>>>>>>> nb_segs is only valid in the first segment, like >> for >>>> many >>>>>>>> fields >>>>>>>>>> (port, >>>>>>>>>>>>> ol_flags, vlan, rss, ...). >>>>>>>>>>>>> >>>>>>>>>>>>> If we say that nb_segs has to be valid in any >>>> segments, >>>>>> it >>>>>>>> means >>>>>>>>>> that >>>>>>>>>>>>> chain() or split() will have to update it in all >>>>>> segments, >>>>>>>> which >>>>>>>>>> is not >>>>>>>>>>>>> efficient. >>>>>>>>>>>> >>>>>>>>>>>> Why in all? >>>>>>>>>>>> We can state that nb_segs on non-first segment >> should >>>>>> always >>>>>>>> equal >>>>>>>>>> 1. >>>>>>>>>>>> As I understand in that case, both split() and >> chain() >>>> have >>>>>> to >>>>>>>>>> update nb_segs >>>>>>>>>>>> only for head mbufs, rest ones will remain >> untouched. >>>>>>>>>>> >>>>>>>>>>> Well, anyway, I think it's strange to have a >> constraint >>>> on m- >>>>>>>>> nb_segs >>>>>>>>>> for >>>>>>>>>>> non-first segment. We don't have that kind of >> constraints >>>> for >>>>>>>> other >>>>>>>>>> fields. >>>>>>>>>> >>>>>>>>>> True, we don't. But this is one of the fields we >> consider >>>>>> critical >>>>>>>>>> for proper work of mbuf alloc/free mechanism. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I am not sure that requiring m->nb_segs == 1 on non-first >>>>>> segments >>>>>>>> will provide any benefits. >>>>>>>> >>>>>>>> It would make this patch unneeded. >>>>>>>> So, for direct, non-segmented mbufs pktmbuf_free() will >> remain >>>>>> write- >>>>>>>> free. >>>>>>> >>>>>>> I see. Then I agree with Konstantin that alternative >> solutions >>>> should >>>>>> be considered. >>>>>>> >>>>>>> The benefit regarding free()'ing non-segmented mbufs - which >> is a >>>>>> very common operation - certainly outweighs the cost of >> requiring >>>>>> split()/chain() operations to set the new head mbuf's nb_segs = >> 1. >>>>>>> >>>>>>> Nonetheless, the bug needs to be fixed somehow. >>>>>>> >>>>>>> If we can't come up with a better solution that doesn't break >> the >>>>>> ABI, we are forced to accept the patch. >>>>>>> >>>>>>> Unless the techboard accepts to break the ABI in order to >> avoid >>>> the >>>>>> performance cost of this patch. >>>>>> >>>>>> Did someone notice a performance drop with this patch? >>>>>> On my side, I don't see any regression on a L3 use case. >>>>> >>>>> I am afraid that the DPDK performance regression tests are based >> on >>>> TX immediately following RX, so cache misses in TX may go by >> unnoticed >>>> because RX warmed up the cache for TX already. And similarly for RX >>>> reusing mbufs that have been warmed up by the preceding free() at >> TX. >>>>> >>>>> Please consider testing the performance difference with the mbuf >>>> being completely cold at TX, and going completely cold again before >>>> being reused for RX. >>>>> >>>>>> >>>>>> Let's sumarize: splitting a mbuf chain and freeing it causes >>>> subsequent >>>>>> mbuf >>>>>> allocation to return a mbuf which is not correctly initialized. >>>> There >>>>>> are 2 >>>>>> options to fix it: >>>>>> >>>>>> 1/ change the mbuf free function (this patch) >>>>>> >>>>>> - m->nb_segs would behave like many other field: valid in >> the >>>> first >>>>>> segment, ignored in other segments >>>>>> - may impact performance (suspected) >>>>>> >>>>>> 2/ change all places where a mbuf chain is split, or trimmed >>>>>> >>>>>> - m->nb_segs would have a specific behavior: count the >> number of >>>>>> segments in the first mbuf, should be 1 in the last >> segment, >>>>>> ignored in other ones. >>>>>> - no code change in mbuf library, so no performance impact >>>>>> - need to patch all places where we do a mbuf split or trim. >>>> From >>>>>> afar, >>>>>> I see at least mbuf_cut_seg_ofs() in DPDK. Some external >>>>>> applications >>>>>> may have to be patched (for instance, I already found 3 >> places >>>> in >>>>>> 6WIND code base without a deep search). >>>>>> >>>>>> In my opinion, 1/ is better, except we notice a significant >>>>>> performance, >>>>>> because the (implicit) behavior is unchanged. >>>>>> >>>>>> Whatever the solution, some documentation has to be added. >>>>>> >>>>>> Olivier >>>>>> >>>>> >>>>> Unfortunately, I don't think that anything but the first option >> will >>>> go into 20.11 and stable releases of older versions, so I stand by >> my >>>> acknowledgment of the patch. >>>> >>>> If we are affraid about 20.11 performance (it is legitimate, few >> days >>>> before the release), we can target 21.02. After all, everybody >> lives >>>> with this bug since 2017, so there is no urgency. If accepted and >> well >>>> tested, it can be backported in stable branches. >>> >>> +1 >>> >>> Good thinking, Olivier! >> >> Looking at the changes once again, it probably can be reworked a bit: >> >> - if (m->next != NULL) { >> - m->next = NULL; >> - m->nb_segs = 1; >> - } >> >> + if (m->next != NULL) >> + m->next = NULL; >> + if (m->nb_segs != 1) >> + m->nb_segs = 1; >> >> That way we add one more condition checking, but I suppose it >> shouldn't be that perf critical. >> That way for direct,non-segmented mbuf it still should be write-free. >> Except cases as you described above: chain(), then split(). >> >> Of-course we still need to do perf testing for that approach too. >> So if your preference it to postpone it till 21.02 - that's ok for me. >> Konstantin > > With this suggestion, I cannot imagine any performance drop for direct, non-segmented mbufs: It now reads m->nb_segs, residing in the mbuf's first cache line, but the function already reads m->refcnt in the first cache line; so no cache misses are introduced. +1