From: Olivier Matz <olivier.matz@6wind.com>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] mbuf: fix reset on mbuf free
Date: Fri, 6 Nov 2020 09:20:53 +0100 [thread overview]
Message-ID: <20201106082053.GZ1898@platinum> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C613F8@smartserver.smartshare.dk>
On Fri, Nov 06, 2020 at 08:52:58AM +0100, Morten Brørup wrote:
> > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> > Sent: Friday, November 6, 2020 12:55 AM
> >
> > > > > > > > > > >> Hi Olivier,
> > > > > > > > > > >>
> > > > > > > > > > >>> m->nb_seg must be reset on mbuf free whatever the
> > value
> > > > of m->next,
> > > > > > > > > > >>> because it can happen that m->nb_seg is != 1. For
> > > > instance in this
> > > > > > > > > > >>> case:
> > > > > > > > > > >>>
> > > > > > > > > > >>> m1 = rte_pktmbuf_alloc(mp);
> > > > > > > > > > >>> rte_pktmbuf_append(m1, 500);
> > > > > > > > > > >>> m2 = rte_pktmbuf_alloc(mp);
> > > > > > > > > > >>> rte_pktmbuf_append(m2, 500);
> > > > > > > > > > >>> rte_pktmbuf_chain(m1, m2);
> > > > > > > > > > >>> m0 = rte_pktmbuf_alloc(mp);
> > > > > > > > > > >>> rte_pktmbuf_append(m0, 500);
> > > > > > > > > > >>> rte_pktmbuf_chain(m0, m1);
> > > > > > > > > > >>>
> > > > > > > > > > >>> As rte_pktmbuf_chain() does not reset nb_seg in the
> > > > initial m1
> > > > > > > > > > >>> segment (this is not required), after this code the
> > > > mbuf chain
> > > > > > > > > > >>> have 3 segments:
> > > > > > > > > > >>> - m0: next=m1, nb_seg=3
> > > > > > > > > > >>> - m1: next=m2, nb_seg=2
> > > > > > > > > > >>> - m2: next=NULL, nb_seg=1
> > > > > > > > > > >>>
> > > > > > > > > > >>> Freeing this mbuf chain will not restore nb_seg=1
> > in
> > > > the second
> > > > > > > > > > >>> segment.
> > > > > > > > > > >>
> > > > > > > > > > >> Hmm, not sure why is that?
> > > > > > > > > > >> You are talking about freeing m1, right?
> > > > > > > > > > >> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > > > > > > > >> {
> > > > > > > > > > >> ...
> > > > > > > > > > >> if (m->next != NULL) {
> > > > > > > > > > >> m->next = NULL;
> > > > > > > > > > >> m->nb_segs = 1;
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> m1->next != NULL, so it will enter the if() block,
> > > > > > > > > > >> and will reset both next and nb_segs.
> > > > > > > > > > >> What I am missing here?
> > > > > > > > > > >> Thinking in more generic way, that change:
> > > > > > > > > > >> - if (m->next != NULL) {
> > > > > > > > > > >> - m->next = NULL;
> > > > > > > > > > >> - m->nb_segs = 1;
> > > > > > > > > > >> - }
> > > > > > > > > > >> + m->next = NULL;
> > > > > > > > > > >> + m->nb_segs = 1;
> > > > > > > > > > >
> > > > > > > > > > > Ah, sorry. I oversimplified the example and now it
> > does
> > > > not
> > > > > > > > > > > show the issue...
> > > > > > > > > > >
> > > > > > > > > > > The full example also adds a split() to break the
> > mbuf
> > > > chain
> > > > > > > > > > > between m1 and m2. The kind of thing that would be
> > done
> > > > for
> > > > > > > > > > > software TCP segmentation.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If so, may be the right solution is to care about
> > nb_segs
> > > > > > > > > > when next is set to NULL on split? Any place when next
> > is
> > > > set
> > > > > > > > > > to NULL. Just to keep the optimization in a more
> > generic
> > > > place.
> > > > > > > >
> > > > > > > >
> > > > > > > > > The problem with that approach is that there are already
> > > > several
> > > > > > > > > existing split() or trim() implementations in different
> > DPDK-
> > > > based
> > > > > > > > > applications. For instance, we have some in 6WINDGate. If
> > we
> > > > force
> > > > > > > > > applications to set nb_seg to 1 when resetting next, it
> > has
> > > > to be
> > > > > > > > > documented because it is not straightforward.
> > > > > > > >
> > > > > > > > I think it is better to go that way.
> > > > > > > > From my perspective it seems natural to reset nb_seg at
> > same
> > > > time
> > > > > > > > we reset next, otherwise inconsistency will occur.
> > > > > > >
> > > > > > > While it is not explicitly stated for nb_segs, to me it was
> > clear
> > > > that
> > > > > > > nb_segs is only valid in the first segment, like for many
> > fields
> > > > (port,
> > > > > > > ol_flags, vlan, rss, ...).
> > > > > > >
> > > > > > > If we say that nb_segs has to be valid in any segments, it
> > means
> > > > that
> > > > > > > chain() or split() will have to update it in all segments,
> > which
> > > > is not
> > > > > > > efficient.
> > > > > >
> > > > > > Why in all?
> > > > > > We can state that nb_segs on non-first segment should always
> > equal
> > > > 1.
> > > > > > As I understand in that case, both split() and chain() have to
> > > > update nb_segs
> > > > > > only for head mbufs, rest ones will remain untouched.
> > > > >
> > > > > Well, anyway, I think it's strange to have a constraint on m-
> > >nb_segs
> > > > for
> > > > > non-first segment. We don't have that kind of constraints for
> > other
> > > > fields.
> > > >
> > > > True, we don't. But this is one of the fields we consider critical
> > > > for proper work of mbuf alloc/free mechanism.
> > > >
> > >
> > > I am not sure that requiring m->nb_segs == 1 on non-first segments
> > will provide any benefits.
> >
> > It would make this patch unneeded.
> > So, for direct, non-segmented mbufs pktmbuf_free() will remain write-
> > free.
>
> I see. Then I agree with Konstantin that alternative solutions should be considered.
>
> The benefit regarding free()'ing non-segmented mbufs - which is a very common operation - certainly outweighs the cost of requiring split()/chain() operations to set the new head mbuf's nb_segs = 1.
>
> Nonetheless, the bug needs to be fixed somehow.
>
> If we can't come up with a better solution that doesn't break the ABI, we are forced to accept the patch.
>
> Unless the techboard accepts to break the ABI in order to avoid the performance cost of this patch.
Did someone notice a performance drop with this patch?
On my side, I don't see any regression on a L3 use case.
Let's sumarize: splitting a mbuf chain and freeing it causes subsequent mbuf
allocation to return a mbuf which is not correctly initialized. There are 2
options to fix it:
1/ change the mbuf free function (this patch)
- m->nb_segs would behave like many other field: valid in the first
segment, ignored in other segments
- may impact performance (suspected)
2/ change all places where a mbuf chain is split, or trimmed
- m->nb_segs would have a specific behavior: count the number of
segments in the first mbuf, should be 1 in the last segment,
ignored in other ones.
- no code change in mbuf library, so no performance impact
- need to patch all places where we do a mbuf split or trim. From afar,
I see at least mbuf_cut_seg_ofs() in DPDK. Some external applications
may have to be patched (for instance, I already found 3 places in
6WIND code base without a deep search).
In my opinion, 1/ is better, except we notice a significant performance,
because the (implicit) behavior is unchanged.
Whatever the solution, some documentation has to be added.
Olivier
>
> >
> > >
> > > E.g. the second segment of a three-segment chain will still have m-
> > >next != NULL, so it cannot be used as a gate to prevent accessing m-
> > > >next.
> > >
> > > I might have overlooked something, though.
> > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Saying that nb_segs has to be valid for the first and last
> > > > segment seems
> > > > > > > really odd to me. What is the logic behind that except
> > keeping
> > > > this test
> > > > > > > as is?
> > > > > > >
> > > > > > > In any case, it has to be better documented.
> > > > > > >
> > > > > > >
> > > > > > > Olivier
> > > > > > >
> > > > > > >
> > > > > > > > > I think the approach from
> > > > > > > > > this patch is safer.
> > > > > > > >
> > > > > > > > It might be easier from perspective that changes in less
> > places
> > > > are required,
> > > > > > > > Though I think that this patch will introduce some
> > performance
> > > > drop.
> > > > > > > > As now each mbuf_prefree_seg() will cause update of 2 cache
> > > > lines unconditionally.
> > > > > > > >
> > > > > > > > > By the way, for 21.11, if we are able to do some
> > > > optimizations and have
> > > > > > > > > both pool (index?) and next in the first cache line, we
> > may
> > > > reconsider
> > > > > > > > > the fact that next and nb_segs are already set for new
> > > > allocated mbufs,
> > > > > > > > > because it is not straightforward either.
> > > > > > > >
> > > > > > > > My suggestion - let's put future optimization discussion
> > aside
> > > > for now,
> > > > > > > > and concentrate on that particular patch.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > After this operation, we have 2 mbuf chain:
> > > > > > > > > > > - m0 with 2 segments, the last one has next=NULL but
> > > > nb_seg=2
> > > > > > > > > > > - new_m with 1 segment
> > > > > > > > > > >
> > > > > > > > > > > Freeing m0 will not restore nb_seg=1 in the second
> > > > segment.
> > > > > > > > > > >
> > > > > > > > > > >> Assumes that it is ok to have an mbuf with
> > > > > > > > > > >> nb_seg > 1 and next == NULL.
> > > > > > > > > > >> Which seems wrong to me.
> > > > > > > > > > >
> > > > > > > > > > > I don't think it is wrong: nb_seg is just ignored
> > when
> > > > not in the first
> > > > > > > > > > > segment, and there is nothing saying it should be set
> > to
> > > > 1. Typically,
> > > > > > > > > > > rte_pktmbuf_chain() does not change it, and I guess
> > it's
> > > > the same for
> > > > > > > > > > > many similar functions in applications.
> > > > > > > > > > >
> > > > > > > > > > > Olivier
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>> This is expected that mbufs stored in pool have
> > their
> > > > > > > > > > >>> nb_seg field set to 1.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Fixes: 8f094a9ac5d7 ("mbuf: set mbuf fields while
> > in
> > > > pool")
> > > > > > > > > > >>> Cc: stable@dpdk.org
> > > > > > > > > > >>>
> > > > > > > > > > >>> Signed-off-by: Olivier Matz
> > <olivier.matz@6wind.com>
> > > > > > > > > > >>> ---
> > > > > > > > > > >>> lib/librte_mbuf/rte_mbuf.c | 6 ++----
> > > > > > > > > > >>> lib/librte_mbuf/rte_mbuf.h | 12 ++++--------
> > > > > > > > > > >>> 2 files changed, 6 insertions(+), 12 deletions(-)
> > > > > > > > > > >>>
> > > > > > > > > > >>> diff --git a/lib/librte_mbuf/rte_mbuf.c
> > > > b/lib/librte_mbuf/rte_mbuf.c
> > > > > > > > > > >>> index 8a456e5e64..e632071c23 100644
> > > > > > > > > > >>> --- a/lib/librte_mbuf/rte_mbuf.c
> > > > > > > > > > >>> +++ b/lib/librte_mbuf/rte_mbuf.c
> > > > > > > > > > >>> @@ -129,10 +129,8 @@
> > > > rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
> > > > > > > > > > >>>
> > > > > > > > > > >>> rte_mbuf_ext_refcnt_set(m->shinfo, 1);
> > > > > > > > > > >>> m->ol_flags = EXT_ATTACHED_MBUF;
> > > > > > > > > > >>> - if (m->next != NULL) {
> > > > > > > > > > >>> - m->next = NULL;
> > > > > > > > > > >>> - m->nb_segs = 1;
> > > > > > > > > > >>> - }
> > > > > > > > > > >>> + m->next = NULL;
> > > > > > > > > > >>> + m->nb_segs = 1;
> > > > > > > > > > >>> rte_mbuf_raw_free(m);
> > > > > > > > > > >>> }
> > > > > > > > > > >>>
> > > > > > > > > > >>> diff --git a/lib/librte_mbuf/rte_mbuf.h
> > > > b/lib/librte_mbuf/rte_mbuf.h
> > > > > > > > > > >>> index a1414ed7cd..ef5800c8ef 100644
> > > > > > > > > > >>> --- a/lib/librte_mbuf/rte_mbuf.h
> > > > > > > > > > >>> +++ b/lib/librte_mbuf/rte_mbuf.h
> > > > > > > > > > >>> @@ -1329,10 +1329,8 @@
> > rte_pktmbuf_prefree_seg(struct
> > > > rte_mbuf *m)
> > > > > > > > > > >>> return NULL;
> > > > > > > > > > >>> }
> > > > > > > > > > >>>
> > > > > > > > > > >>> - if (m->next != NULL) {
> > > > > > > > > > >>> - m->next = NULL;
> > > > > > > > > > >>> - m->nb_segs = 1;
> > > > > > > > > > >>> - }
> > > > > > > > > > >>> + m->next = NULL;
> > > > > > > > > > >>> + m->nb_segs = 1;
> > > > > > > > > > >>>
> > > > > > > > > > >>> return m;
> > > > > > > > > > >>>
> > > > > > > > > > >>> @@ -1346,10 +1344,8 @@
> > rte_pktmbuf_prefree_seg(struct
> > > > rte_mbuf *m)
> > > > > > > > > > >>> return NULL;
> > > > > > > > > > >>> }
> > > > > > > > > > >>>
> > > > > > > > > > >>> - if (m->next != NULL) {
> > > > > > > > > > >>> - m->next = NULL;
> > > > > > > > > > >>> - m->nb_segs = 1;
> > > > > > > > > > >>> - }
> > > > > > > > > > >>> + m->next = NULL;
> > > > > > > > > > >>> + m->nb_segs = 1;
> > > > > > > > > > >>> rte_mbuf_refcnt_set(m, 1);
> > > > > > > > > > >>>
> > > > > > > > > > >>> return m;
> > > > > > > > > > >>> --
> > > > > > > > > > >>> 2.25.1
> > > > > > > > > > >>
> > > > > > > > > >
>
next prev parent reply other threads:[~2020-11-06 8:20 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-04 17:00 Olivier Matz
2020-11-05 0:15 ` Ananyev, Konstantin
2020-11-05 7:46 ` Olivier Matz
2020-11-05 8:26 ` Andrew Rybchenko
2020-11-05 9:10 ` Olivier Matz
2020-11-05 11:34 ` Ananyev, Konstantin
2020-11-05 12:31 ` Olivier Matz
2020-11-05 13:14 ` Ananyev, Konstantin
2020-11-05 13:24 ` Olivier Matz
2020-11-05 13:55 ` Ananyev, Konstantin
2020-11-05 16:30 ` Morten Brørup
2020-11-05 23:55 ` Ananyev, Konstantin
2020-11-06 7:52 ` Morten Brørup
2020-11-06 8:20 ` Olivier Matz [this message]
2020-11-06 8:50 ` Morten Brørup
2020-11-06 10:04 ` Olivier Matz
2020-11-06 10:07 ` Morten Brørup
2020-11-06 11:53 ` Ananyev, Konstantin
2020-11-06 12:23 ` Morten Brørup
2020-11-08 14:16 ` Andrew Rybchenko
2020-11-08 14:19 ` Ananyev, Konstantin
2020-11-10 16:26 ` Olivier Matz
2020-11-05 8:33 ` Morten Brørup
2020-11-05 9:03 ` Olivier Matz
2020-11-05 9:09 ` Andrew Rybchenko
2020-11-08 7:25 ` Ali Alnubani
2020-12-18 12:52 ` [dpdk-dev] [PATCH v2] " Olivier Matz
2020-12-18 13:18 ` Morten Brørup
2020-12-18 23:33 ` Ajit Khaparde
2021-01-06 13:33 ` [dpdk-dev] [PATCH v3] " Olivier Matz
2021-01-10 9:28 ` Ali Alnubani
2021-01-11 13:14 ` Ananyev, Konstantin
2021-01-13 13:27 ` [dpdk-dev] [PATCH v4] " Olivier Matz
2021-01-15 13:59 ` [dpdk-dev] [dpdk-stable] " David Marchand
2021-01-15 18:39 ` Ali Alnubani
2021-01-18 17:52 ` Ali Alnubani
2021-01-19 8:32 ` Olivier Matz
2021-01-19 8:53 ` Morten Brørup
2021-01-19 12:00 ` Ferruh Yigit
2021-01-19 12:27 ` Morten Brørup
2021-01-19 14:03 ` Ferruh Yigit
2021-01-19 14:21 ` Morten Brørup
2021-01-21 9:15 ` Ferruh Yigit
2021-01-19 14:04 ` Slava Ovsiienko
2021-07-24 8:47 ` Thomas Monjalon
2021-07-30 12:36 ` Olivier Matz
2021-07-30 14:35 ` Morten Brørup
2021-07-30 14:54 ` Thomas Monjalon
2021-07-30 15:14 ` Olivier Matz
2021-07-30 15:23 ` Morten Brørup
2021-08-04 13:29 ` [dpdk-dev] [PATCH] doc: add known issue with mbuf segment Thomas Monjalon
2021-08-04 14:25 ` Ajit Khaparde
2021-08-05 6:08 ` Morten Brørup
2021-08-06 14:21 ` Thomas Monjalon
2021-08-06 14:24 ` Morten Brørup
2021-09-28 8:28 ` [dpdk-dev] [dpdk-stable] [PATCH v4] mbuf: fix reset on mbuf free Thomas Monjalon
2021-09-28 9:00 ` Slava Ovsiienko
2021-09-28 9:25 ` Ananyev, Konstantin
2021-09-28 9:39 ` Morten Brørup
2021-09-29 8:03 ` Ali Alnubani
2021-09-29 21:39 ` Olivier Matz
2021-09-30 13:29 ` Ali Alnubani
2021-10-21 8:26 ` Thomas Monjalon
2021-01-21 9:19 ` Ferruh Yigit
2021-01-21 9:29 ` Morten Brørup
2021-01-21 16:35 ` [dpdk-dev] [dpdklab] " Lincoln Lavoie
2021-01-23 8:57 ` Morten Brørup
2021-01-25 17:00 ` Brandon Lo
2021-01-25 18:42 ` Ferruh Yigit
2021-06-15 13:56 ` [dpdk-dev] " Morten Brørup
2021-09-29 21:37 ` [dpdk-dev] [PATCH v5] " Olivier Matz
2021-09-30 13:27 ` Ali Alnubani
2021-10-21 9:18 ` David Marchand
2022-07-28 14:06 ` CI performance test results might be misleading Morten Brørup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201106082053.GZ1898@platinum \
--to=olivier.matz@6wind.com \
--cc=andrew.rybchenko@oktetlabs.ru \
--cc=dev@dpdk.org \
--cc=konstantin.ananyev@intel.com \
--cc=mb@smartsharesystems.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).