* [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing @ 2020-03-16 18:31 Alexander Kozyrev 2020-03-19 9:30 ` Olivier Matz 2020-03-20 15:55 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev 0 siblings, 2 replies; 6+ messages in thread From: Alexander Kozyrev @ 2020-03-16 18:31 UTC (permalink / raw) To: dev; +Cc: olivier.matz, viacheslavo, matan, thomas, stable Introduction of pinned external buffers doubled memory loads in the rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly code shows unnecessary load of the pool field of the rte_mbuf structure. Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": Before the change the code was: movq 0x18(%rbx), %rax // load the ol_flags field test %r13, %rax // check if ol_flags equals to 0x60...0 jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" After the change the code becomed: movq 0x18(%rbx), %rax // load ol_flags test %r14, %rax // check if ol_flags equals to 0x60...0 jnz 0x9bea38 <Block 2> // jump in to "if (!RTE_MBUF_HAS_EXTBUF(m)" movq 0x48(%rbx), %rax // load the pool field jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" Look like this absolutely unneeded memory load of the pool field is an optimization for the external buffer case in GCC (4.8.5), since Clang generates the same assembly for both before and after the chenge versions. Plus, GCC favors the extrnal buffer case over the simple case. This assembly code layout causes the performance degradation because the rte_pktmbuf_prefree_seg() function is a part of a very hot path. Workaround this compilation issue by moving the check for pinned buffer apart from the check for external buffer and restore the initial code flow that favors the direct mbuf case over the external one. Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") Cc: stable@dpdk.org Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> --- lib/librte_mbuf/rte_mbuf.h | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 34679e0..ab9d3f5 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -1335,10 +1335,9 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) if (likely(rte_mbuf_refcnt_read(m) == 1)) { if (!RTE_MBUF_DIRECT(m)) { - if (!RTE_MBUF_HAS_EXTBUF(m) || - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) - rte_pktmbuf_detach(m); - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) + rte_pktmbuf_detach(m); + if (RTE_MBUF_HAS_PINNED_EXTBUF(m) && + __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } @@ -1352,10 +1351,9 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { - if (!RTE_MBUF_HAS_EXTBUF(m) || - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) - rte_pktmbuf_detach(m); - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) + rte_pktmbuf_detach(m); + if (RTE_MBUF_HAS_PINNED_EXTBUF(m) && + __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } -- 1.8.3.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing 2020-03-16 18:31 [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing Alexander Kozyrev @ 2020-03-19 9:30 ` Olivier Matz 2020-03-20 15:35 ` Alexander Kozyrev 2020-03-20 15:55 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev 1 sibling, 1 reply; 6+ messages in thread From: Olivier Matz @ 2020-03-19 9:30 UTC (permalink / raw) To: Alexander Kozyrev; +Cc: dev, viacheslavo, matan, thomas, stable Hi, On Mon, Mar 16, 2020 at 06:31:40PM +0000, Alexander Kozyrev wrote: > Introduction of pinned external buffers doubled memory loads in the > rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly > code shows unnecessary load of the pool field of the rte_mbuf structure. > Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": > Before the change the code was: > movq 0x18(%rbx), %rax // load the ol_flags field > test %r13, %rax // check if ol_flags equals to 0x60...0 > jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" > After the change the code becomed: > movq 0x18(%rbx), %rax // load ol_flags > test %r14, %rax // check if ol_flags equals to 0x60...0 > jnz 0x9bea38 <Block 2> // jump in to "if (!RTE_MBUF_HAS_EXTBUF(m)" > movq 0x48(%rbx), %rax // load the pool field > jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" > Look like this absolutely unneeded memory load of the pool field is an > optimization for the external buffer case in GCC (4.8.5), since Clang > generates the same assembly for both before and after the chenge versions. > Plus, GCC favors the extrnal buffer case over the simple case. > This assembly code layout causes the performance degradation because the > rte_pktmbuf_prefree_seg() function is a part of a very hot path. > Workaround this compilation issue by moving the check for pinned buffer > apart from the check for external buffer and restore the initial code > flow that favors the direct mbuf case over the external one. > > Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") > Cc: stable@dpdk.org > > Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> > Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> > --- > lib/librte_mbuf/rte_mbuf.h | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > index 34679e0..ab9d3f5 100644 > --- a/lib/librte_mbuf/rte_mbuf.h > +++ b/lib/librte_mbuf/rte_mbuf.h > @@ -1335,10 +1335,9 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > > if (!RTE_MBUF_DIRECT(m)) { > - if (!RTE_MBUF_HAS_EXTBUF(m) || > - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) > - rte_pktmbuf_detach(m); > - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) > + rte_pktmbuf_detach(m); > + if (RTE_MBUF_HAS_PINNED_EXTBUF(m) && > + __rte_pktmbuf_pinned_extbuf_decref(m)) > return NULL; > } > [...] Reading the previous code again, it was correct but not easy to understand, especially the: if (!RTE_MBUF_HAS_EXTBUF(m) || !RTE_MBUF_HAS_PINNED_EXTBUF(m)) Knowing we already checked it is not a direct mbuf, it is equivalent to: if (!RTE_MBUF_HAS_PINNED_EXTBUF(m)) I think the objective was to avoid an access to the pool flags if not necessary. Completely removing the test as you did is also functionally OK, because rte_pktmbuf_detach() also does the check, and the code is even clearer. I wonder however if doing this wouldn't avoid an access to the pool flags for mbufs which have the IND_ATTACHED flags: if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && RTE_MBUF_HAS_PINNED_EXTBUF(m) && __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } What do you think? Nit: if you wish to send a v2, there are few english fixes that could be done (becomed, chenge, extrnal) Thanks ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing 2020-03-19 9:30 ` Olivier Matz @ 2020-03-20 15:35 ` Alexander Kozyrev 0 siblings, 0 replies; 6+ messages in thread From: Alexander Kozyrev @ 2020-03-20 15:35 UTC (permalink / raw) To: Olivier Matz; +Cc: dev, Slava Ovsiienko, Matan Azrad, Thomas Monjalon, stable You are right, Olivier, thanks for your suggestion - it looks even better. I've tested this version and the performance is great - will send a v2 shortly. Regards, Alex > -----Original Message----- > From: Olivier Matz <olivier.matz@6wind.com> > Sent: Thursday, March 19, 2020 5:30 > To: Alexander Kozyrev <akozyrev@mellanox.com> > Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@mellanox.com>; Matan > Azrad <matan@mellanox.com>; Thomas Monjalon > <thomas@monjalon.net>; stable@dpdk.org > Subject: Re: [PATCH] mbuf: optimize memory loads during mbuf freeing > > Hi, > > On Mon, Mar 16, 2020 at 06:31:40PM +0000, Alexander Kozyrev wrote: > > Introduction of pinned external buffers doubled memory loads in the > > rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly > > code shows unnecessary load of the pool field of the rte_mbuf structure. > > Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": > > Before the change the code was: > > movq 0x18(%rbx), %rax // load the ol_flags field > > test %r13, %rax // check if ol_flags equals to 0x60...0 > > jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" > > After the change the code becomed: > > movq 0x18(%rbx), %rax // load ol_flags > > test %r14, %rax // check if ol_flags equals to 0x60...0 > > jnz 0x9bea38 <Block 2> // jump in to "if > (!RTE_MBUF_HAS_EXTBUF(m)" > > movq 0x48(%rbx), %rax // load the pool field > > jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" > > Look like this absolutely unneeded memory load of the pool field is an > > optimization for the external buffer case in GCC (4.8.5), since Clang > > generates the same assembly for both before and after the chenge > versions. > > Plus, GCC favors the extrnal buffer case over the simple case. > > This assembly code layout causes the performance degradation because > > the > > rte_pktmbuf_prefree_seg() function is a part of a very hot path. > > Workaround this compilation issue by moving the check for pinned > > buffer apart from the check for external buffer and restore the > > initial code flow that favors the direct mbuf case over the external one. > > > > Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") > > Cc: stable@dpdk.org > > > > Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> > > Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> > > --- > > lib/librte_mbuf/rte_mbuf.h | 14 ++++++-------- > > 1 file changed, 6 insertions(+), 8 deletions(-) > > > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > index 34679e0..ab9d3f5 100644 > > --- a/lib/librte_mbuf/rte_mbuf.h > > +++ b/lib/librte_mbuf/rte_mbuf.h > > @@ -1335,10 +1335,9 @@ static inline int > __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) > > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > > > > if (!RTE_MBUF_DIRECT(m)) { > > - if (!RTE_MBUF_HAS_EXTBUF(m) || > > - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) > > - rte_pktmbuf_detach(m); > > - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) > > + rte_pktmbuf_detach(m); > > + if (RTE_MBUF_HAS_PINNED_EXTBUF(m) && > > + __rte_pktmbuf_pinned_extbuf_decref(m)) > > return NULL; > > } > > > [...] > > Reading the previous code again, it was correct but not easy to understand, > especially the: > > if (!RTE_MBUF_HAS_EXTBUF(m) || !RTE_MBUF_HAS_PINNED_EXTBUF(m)) > > Knowing we already checked it is not a direct mbuf, it is equivalent to: > > if (!RTE_MBUF_HAS_PINNED_EXTBUF(m)) > > I think the objective was to avoid an access to the pool flags if not necessary. > > Completely removing the test as you did is also functionally OK, because > rte_pktmbuf_detach() also does the check, and the code is even clearer. > > I wonder however if doing this wouldn't avoid an access to the pool flags for > mbufs which have the IND_ATTACHED flags: > > if (!RTE_MBUF_DIRECT(m)) { > rte_pktmbuf_detach(m); > if (RTE_MBUF_HAS_EXTBUF(m) && > RTE_MBUF_HAS_PINNED_EXTBUF(m) && > __rte_pktmbuf_pinned_extbuf_decref(m)) > return NULL; > } > > What do you think? > > Nit: if you wish to send a v2, there are few english fixes that could be done > (becomed, chenge, extrnal) > > Thanks ^ permalink raw reply [flat|nested] 6+ messages in thread
* [dpdk-dev] [PATCH v2] mbuf: optimize memory loads during mbuf freeing 2020-03-16 18:31 [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing Alexander Kozyrev 2020-03-19 9:30 ` Olivier Matz @ 2020-03-20 15:55 ` Alexander Kozyrev 2020-03-27 8:13 ` Olivier Matz 1 sibling, 1 reply; 6+ messages in thread From: Alexander Kozyrev @ 2020-03-20 15:55 UTC (permalink / raw) To: dev; +Cc: olivier.matz, viacheslavo, matan, thomas, stable Introduction of pinned external buffers doubled memory loads in the rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly code shows unnecessary load of the pool field of the rte_mbuf structure. Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": Before the change the code was: movq 0x18(%rbx), %rax // load the ol_flags field test %r13, %rax // check if ol_flags equals to 0x60...0 jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" After the change the code became: movq 0x18(%rbx), %rax // load ol_flags test %r14, %rax // check if ol_flags equals to 0x60...0 jnz 0x9bea38 <Block 2> // jump in to "if (!RTE_MBUF_HAS_EXTBUF(m)" movq 0x48(%rbx), %rax // load the pool field jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" Look like this absolutely unneeded memory load of the pool field is an optimization for the external buffer case in GCC (4.8.5), since Clang generates the same assembly for both before and after the change versions. Plus, GCC favors the external buffer case over the simple case. This assembly code layout causes the performance degradation because the rte_pktmbuf_prefree_seg() function is a part of a very hot path. Workaround this compilation issue by moving the check for pinned buffer apart from the check for external buffer and restore the initial code flow that favors the direct mbuf case over the external one. Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") Cc: stable@dpdk.org Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> --- lib/librte_mbuf/rte_mbuf.h | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 34679e0..f8e492e 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -1335,10 +1335,10 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) if (likely(rte_mbuf_refcnt_read(m) == 1)) { if (!RTE_MBUF_DIRECT(m)) { - if (!RTE_MBUF_HAS_EXTBUF(m) || - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) - rte_pktmbuf_detach(m); - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) + rte_pktmbuf_detach(m); + if (RTE_MBUF_HAS_EXTBUF(m) && + RTE_MBUF_HAS_PINNED_EXTBUF(m) && + __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } @@ -1352,10 +1352,10 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { - if (!RTE_MBUF_HAS_EXTBUF(m) || - !RTE_MBUF_HAS_PINNED_EXTBUF(m)) - rte_pktmbuf_detach(m); - else if (__rte_pktmbuf_pinned_extbuf_decref(m)) + rte_pktmbuf_detach(m); + if (RTE_MBUF_HAS_EXTBUF(m) && + RTE_MBUF_HAS_PINNED_EXTBUF(m) && + __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } -- 1.8.3.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH v2] mbuf: optimize memory loads during mbuf freeing 2020-03-20 15:55 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev @ 2020-03-27 8:13 ` Olivier Matz 2020-03-31 1:46 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon 0 siblings, 1 reply; 6+ messages in thread From: Olivier Matz @ 2020-03-27 8:13 UTC (permalink / raw) To: Alexander Kozyrev; +Cc: dev, viacheslavo, matan, thomas, stable Hi, On Fri, Mar 20, 2020 at 03:55:15PM +0000, Alexander Kozyrev wrote: > Introduction of pinned external buffers doubled memory loads in the > rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly > code shows unnecessary load of the pool field of the rte_mbuf structure. > Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": > Before the change the code was: > movq 0x18(%rbx), %rax // load the ol_flags field > test %r13, %rax // check if ol_flags equals to 0x60...0 > jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" > After the change the code became: > movq 0x18(%rbx), %rax // load ol_flags > test %r14, %rax // check if ol_flags equals to 0x60...0 > jnz 0x9bea38 <Block 2> // jump in to "if (!RTE_MBUF_HAS_EXTBUF(m)" > movq 0x48(%rbx), %rax // load the pool field > jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" > Look like this absolutely unneeded memory load of the pool field is an > optimization for the external buffer case in GCC (4.8.5), since Clang > generates the same assembly for both before and after the change versions. > Plus, GCC favors the external buffer case over the simple case. > This assembly code layout causes the performance degradation because the > rte_pktmbuf_prefree_seg() function is a part of a very hot path. > Workaround this compilation issue by moving the check for pinned buffer > apart from the check for external buffer and restore the initial code > flow that favors the direct mbuf case over the external one. > > Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") > Cc: stable@dpdk.org > > Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> > Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> Acked-by: Olivier Matz <olivier.matz@6wind.com> Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH v2] mbuf: optimize memory loads during mbuf freeing 2020-03-27 8:13 ` Olivier Matz @ 2020-03-31 1:46 ` Thomas Monjalon 0 siblings, 0 replies; 6+ messages in thread From: Thomas Monjalon @ 2020-03-31 1:46 UTC (permalink / raw) To: Alexander Kozyrev; +Cc: stable, dev, viacheslavo, matan, stable, Olivier Matz 27/03/2020 09:13, Olivier Matz: > On Fri, Mar 20, 2020 at 03:55:15PM +0000, Alexander Kozyrev wrote: > > Introduction of pinned external buffers doubled memory loads in the > > rte_pktmbuf_prefree_seg() function. Analysis of the generated assembly > > code shows unnecessary load of the pool field of the rte_mbuf structure. > > Here is the snippet of the assembly for "if (!RTE_MBUF_DIRECT(m))": > > Before the change the code was: > > movq 0x18(%rbx), %rax // load the ol_flags field > > test %r13, %rax // check if ol_flags equals to 0x60...0 > > jz 0x9a8718 <Block 2> // jump out to "if (m->next != NULL)" > > After the change the code became: > > movq 0x18(%rbx), %rax // load ol_flags > > test %r14, %rax // check if ol_flags equals to 0x60...0 > > jnz 0x9bea38 <Block 2> // jump in to "if (!RTE_MBUF_HAS_EXTBUF(m)" > > movq 0x48(%rbx), %rax // load the pool field > > jmp 0x9bea78 <Block 7> // jump out to "if (m->next != NULL)" > > Look like this absolutely unneeded memory load of the pool field is an > > optimization for the external buffer case in GCC (4.8.5), since Clang > > generates the same assembly for both before and after the change versions. > > Plus, GCC favors the external buffer case over the simple case. > > This assembly code layout causes the performance degradation because the > > rte_pktmbuf_prefree_seg() function is a part of a very hot path. > > Workaround this compilation issue by moving the check for pinned buffer > > apart from the check for external buffer and restore the initial code > > flow that favors the direct mbuf case over the external one. > > > > Fixes: 6ef1107ad4c6 ("mbuf: detach mbuf with pinned external buffer") > > Cc: stable@dpdk.org > > > > Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com> > > Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com> > > Acked-by: Olivier Matz <olivier.matz@6wind.com> > > Thanks! Applied, thanks ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-03-31 1:46 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-16 18:31 [dpdk-dev] [PATCH] mbuf: optimize memory loads during mbuf freeing Alexander Kozyrev 2020-03-19 9:30 ` Olivier Matz 2020-03-20 15:35 ` Alexander Kozyrev 2020-03-20 15:55 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev 2020-03-27 8:13 ` Olivier Matz 2020-03-31 1:46 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).