DPDK patches and discussions
 help / color / mirror / Atom feed
* rte_memcpy alignment
@ 2022-01-14  8:56 Morten Brørup
  2022-01-14  9:11 ` Bruce Richardson
  0 siblings, 1 reply; 8+ messages in thread
From: Morten Brørup @ 2022-01-14  8:56 UTC (permalink / raw)
  To: Jan Viktorin, Ruifeng Wang, David Christensen, Bruce Richardson,
	Konstantin Ananyev
  Cc: dev

Dear ARM/POWER/x86 maintainers,

The architecture specific rte_memcpy() provides optimized variants to copy aligned data. However, the alignment requirements depend on the hardware architecture, and there is no common definition for the alignment.

DPDK provides __rte_cache_aligned for cache optimization purposes, with architecture specific values. Would you consider providing an __rte_memcpy_aligned for rte_memcpy() optimization purposes?

Or should I just use __rte_cache_aligned, although it is overkill?


Specifically, I am working on a mempool optimization where the objs field in the rte_mempool_cache structure may benefit by being aligned for optimized rte_memcpy().


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rte_memcpy alignment
  2022-01-14  8:56 rte_memcpy alignment Morten Brørup
@ 2022-01-14  9:11 ` Bruce Richardson
  2022-01-14  9:53   ` Morten Brørup
  0 siblings, 1 reply; 8+ messages in thread
From: Bruce Richardson @ 2022-01-14  9:11 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Konstantin Ananyev, dev

On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> Dear ARM/POWER/x86 maintainers,
> 
> The architecture specific rte_memcpy() provides optimized variants to copy aligned data. However, the alignment requirements depend on the hardware architecture, and there is no common definition for the alignment.
> 
> DPDK provides __rte_cache_aligned for cache optimization purposes, with architecture specific values. Would you consider providing an __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> 
> Or should I just use __rte_cache_aligned, although it is overkill?
> 
> 
> Specifically, I am working on a mempool optimization where the objs field in the rte_mempool_cache structure may benefit by being aligned for optimized rte_memcpy().
>
For me the difficulty with such a memcpy proposal - apart from probably
adding to the amount of memcpy code we have to maintain - is the specific meaning
of what "aligned" in the memcpy case. Unlike for a struct definition, the
possible meaning of aligned in memcpy could be:
* the source address is aligned
* the destination address is aligned
* both source and destination is aligned
* both source and destination are aligned and the copy length is a multiple
  of the alignment length
* the data is aligned to a cacheline boundary
* the data is aligned to the largest load-store size for system
* the data is aligned to the boundary suitable for the copy size, e.g.
  memcpy of 8 bytes is 8-byte aligned etc.

Can you clarify a bit more on your own thinking here? Personally, I am a
little dubious of the benefit of general memcpy optimization, but I do
believe that for specific usecases there is value is having their own copy
operations which include constraints for that specific usecase. For
example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the
mempool cache into the descriptor rearm function because we know we can
always do 64-byte loads and stores, and also because we know that for each
load in the copy, we can reuse the data just after storing it (giving good
perf boost). Perhaps something similar could work for you in your mempool
optimization.

/Bruce

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: rte_memcpy alignment
  2022-01-14  9:11 ` Bruce Richardson
@ 2022-01-14  9:53   ` Morten Brørup
  2022-01-14 10:22     ` Bruce Richardson
  2022-01-14 10:54     ` Ananyev, Konstantin
  0 siblings, 2 replies; 8+ messages in thread
From: Morten Brørup @ 2022-01-14  9:53 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Konstantin Ananyev, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 14 January 2022 10.11
> 
> On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > Dear ARM/POWER/x86 maintainers,
> >
> > The architecture specific rte_memcpy() provides optimized variants to
> copy aligned data. However, the alignment requirements depend on the
> hardware architecture, and there is no common definition for the
> alignment.
> >
> > DPDK provides __rte_cache_aligned for cache optimization purposes,
> with architecture specific values. Would you consider providing an
> __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> >
> > Or should I just use __rte_cache_aligned, although it is overkill?
> >
> >
> > Specifically, I am working on a mempool optimization where the objs
> field in the rte_mempool_cache structure may benefit by being aligned
> for optimized rte_memcpy().
> >
> For me the difficulty with such a memcpy proposal - apart from probably
> adding to the amount of memcpy code we have to maintain - is the
> specific meaning
> of what "aligned" in the memcpy case. Unlike for a struct definition,
> the
> possible meaning of aligned in memcpy could be:
> * the source address is aligned
> * the destination address is aligned
> * both source and destination is aligned
> * both source and destination are aligned and the copy length is a
> multiple
>   of the alignment length
> * the data is aligned to a cacheline boundary
> * the data is aligned to the largest load-store size for system
> * the data is aligned to the boundary suitable for the copy size, e.g.
>   memcpy of 8 bytes is 8-byte aligned etc.
> 
> Can you clarify a bit more on your own thinking here? Personally, I am
> a
> little dubious of the benefit of general memcpy optimization, but I do
> believe that for specific usecases there is value is having their own
> copy
> operations which include constraints for that specific usecase. For
> example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the
> mempool cache into the descriptor rearm function because we know we can
> always do 64-byte loads and stores, and also because we know that for
> each
> load in the copy, we can reuse the data just after storing it (giving
> good
> perf boost). Perhaps something similar could work for you in your
> mempool
> optimization.
> 
> /Bruce

I'm going to copy array of pointers, specifically the 'objs' array in the rte_mempool_cache structure.

The 'objs' array starts at byte 24, which is only 8 byte aligned. So it always fails the ALIGNMENT_MASK test in the x86 specific rte_memcpy(), and thus cannot ever use the optimized rte_memcpy_aligned() function to copy the array, but will use the rte_memcpy_generic() function.

If the 'objs' array was optimally aligned, and the other array that is being copied to/from is also optimally aligned, rte_memcpy() would use the optimized rte_memcpy_aligned() function.

Please also note that the value of ALIGNMENT_MASK depends on which vector instruction set DPDK is being compiled with.

The other CPU architectures have similar stuff in their rte_memcpy() implementations, and their alignment requirements are also different.

Please also note that rte_memcpy() becomes even more optimized when the size of the memcpy() operation is known at compile time.

So I am asking for a public #define __rte_memcpy_aligned I can use to meet the alignment requirements for optimal rte_memcpy().


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rte_memcpy alignment
  2022-01-14  9:53   ` Morten Brørup
@ 2022-01-14 10:22     ` Bruce Richardson
  2022-01-14 10:54     ` Ananyev, Konstantin
  1 sibling, 0 replies; 8+ messages in thread
From: Bruce Richardson @ 2022-01-14 10:22 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Konstantin Ananyev, dev

On Fri, Jan 14, 2022 at 10:53:54AM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 14 January 2022 10.11
> > 
> > On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > > Dear ARM/POWER/x86 maintainers,
> > >
> > > The architecture specific rte_memcpy() provides optimized variants to
> > copy aligned data. However, the alignment requirements depend on the
> > hardware architecture, and there is no common definition for the
> > alignment.
> > >
> > > DPDK provides __rte_cache_aligned for cache optimization purposes,
> > with architecture specific values. Would you consider providing an
> > __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> > >
> > > Or should I just use __rte_cache_aligned, although it is overkill?
> > >
> > >
> > > Specifically, I am working on a mempool optimization where the objs
> > field in the rte_mempool_cache structure may benefit by being aligned
> > for optimized rte_memcpy().
> > >
> > For me the difficulty with such a memcpy proposal - apart from probably
> > adding to the amount of memcpy code we have to maintain - is the
> > specific meaning
> > of what "aligned" in the memcpy case. Unlike for a struct definition,
> > the
> > possible meaning of aligned in memcpy could be:
> > * the source address is aligned
> > * the destination address is aligned
> > * both source and destination is aligned
> > * both source and destination are aligned and the copy length is a
> > multiple
> >   of the alignment length
> > * the data is aligned to a cacheline boundary
> > * the data is aligned to the largest load-store size for system
> > * the data is aligned to the boundary suitable for the copy size, e.g.
> >   memcpy of 8 bytes is 8-byte aligned etc.
> > 
> > Can you clarify a bit more on your own thinking here? Personally, I am
> > a
> > little dubious of the benefit of general memcpy optimization, but I do
> > believe that for specific usecases there is value is having their own
> > copy
> > operations which include constraints for that specific usecase. For
> > example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the
> > mempool cache into the descriptor rearm function because we know we can
> > always do 64-byte loads and stores, and also because we know that for
> > each
> > load in the copy, we can reuse the data just after storing it (giving
> > good
> > perf boost). Perhaps something similar could work for you in your
> > mempool
> > optimization.
> > 
> > /Bruce
> 
> I'm going to copy array of pointers, specifically the 'objs' array in the rte_mempool_cache structure.
> 
> The 'objs' array starts at byte 24, which is only 8 byte aligned. So it always fails the ALIGNMENT_MASK test in the x86 specific rte_memcpy(), and thus cannot ever use the optimized rte_memcpy_aligned() function to copy the array, but will use the rte_memcpy_generic() function.
> 
> If the 'objs' array was optimally aligned, and the other array that is being copied to/from is also optimally aligned, rte_memcpy() would use the optimized rte_memcpy_aligned() function.
> 
> Please also note that the value of ALIGNMENT_MASK depends on which vector instruction set DPDK is being compiled with.
> 
> The other CPU architectures have similar stuff in their rte_memcpy() implementations, and their alignment requirements are also different.
> 
> Please also note that rte_memcpy() becomes even more optimized when the size of the memcpy() operation is known at compile time.
> 
> So I am asking for a public #define __rte_memcpy_aligned I can use to meet the alignment requirements for optimal rte_memcpy().
>

Thanks for that, I misunderstood your original ask. Things are clearer now,
and it seems reasonable.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: rte_memcpy alignment
  2022-01-14  9:53   ` Morten Brørup
  2022-01-14 10:22     ` Bruce Richardson
@ 2022-01-14 10:54     ` Ananyev, Konstantin
  2022-01-14 11:05       ` Morten Brørup
  1 sibling, 1 reply; 8+ messages in thread
From: Ananyev, Konstantin @ 2022-01-14 10:54 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Friday, January 14, 2022 9:54 AM
> To: Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Jan Viktorin <viktorin@rehivetech.com>; Ruifeng Wang <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org
> Subject: RE: rte_memcpy alignment
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 14 January 2022 10.11
> >
> > On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > > Dear ARM/POWER/x86 maintainers,
> > >
> > > The architecture specific rte_memcpy() provides optimized variants to
> > copy aligned data. However, the alignment requirements depend on the
> > hardware architecture, and there is no common definition for the
> > alignment.
> > >
> > > DPDK provides __rte_cache_aligned for cache optimization purposes,
> > with architecture specific values. Would you consider providing an
> > __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> > >
> > > Or should I just use __rte_cache_aligned, although it is overkill?
> > >
> > >
> > > Specifically, I am working on a mempool optimization where the objs
> > field in the rte_mempool_cache structure may benefit by being aligned
> > for optimized rte_memcpy().
> > >
> > For me the difficulty with such a memcpy proposal - apart from probably
> > adding to the amount of memcpy code we have to maintain - is the
> > specific meaning
> > of what "aligned" in the memcpy case. Unlike for a struct definition,
> > the
> > possible meaning of aligned in memcpy could be:
> > * the source address is aligned
> > * the destination address is aligned
> > * both source and destination is aligned
> > * both source and destination are aligned and the copy length is a
> > multiple
> >   of the alignment length
> > * the data is aligned to a cacheline boundary
> > * the data is aligned to the largest load-store size for system
> > * the data is aligned to the boundary suitable for the copy size, e.g.
> >   memcpy of 8 bytes is 8-byte aligned etc.
> >
> > Can you clarify a bit more on your own thinking here? Personally, I am
> > a
> > little dubious of the benefit of general memcpy optimization, but I do
> > believe that for specific usecases there is value is having their own
> > copy
> > operations which include constraints for that specific usecase. For
> > example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the
> > mempool cache into the descriptor rearm function because we know we can
> > always do 64-byte loads and stores, and also because we know that for
> > each
> > load in the copy, we can reuse the data just after storing it (giving
> > good
> > perf boost). Perhaps something similar could work for you in your
> > mempool
> > optimization.
> >
> > /Bruce
> 
> I'm going to copy array of pointers, specifically the 'objs' array in the rte_mempool_cache structure.
> 
> The 'objs' array starts at byte 24, which is only 8 byte aligned. So it always fails the ALIGNMENT_MASK test in the x86 specific
> rte_memcpy(), and thus cannot ever use the optimized rte_memcpy_aligned() function to copy the array, but will use the
> rte_memcpy_generic() function.
> 
> If the 'objs' array was optimally aligned, and the other array that is being copied to/from is also optimally aligned, rte_memcpy() would use
> the optimized rte_memcpy_aligned() function.
> 
> Please also note that the value of ALIGNMENT_MASK depends on which vector instruction set DPDK is being compiled with.
> 
> The other CPU architectures have similar stuff in their rte_memcpy() implementations, and their alignment requirements are also different.
> 
> Please also note that rte_memcpy() becomes even more optimized when the size of the memcpy() operation is known at compile time.

If the size is known at compile time, rte_memcpy() probably an overkill - modern compilers usually generate fast enough code for such cases.

> 
> So I am asking for a public #define __rte_memcpy_aligned I can use to meet the alignment requirements for optimal rte_memcpy().

Even on x86 ALIGNMENT_MASK could have different values (15/31/63) depending on ISA.
So probably 64 as 'generic' one is the safest bet.
Though I wonder do we really need such micro-optimizations here?
Would it be such huge difference if you call rte_memcpy_aligned() instead of rte_memcpy()?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: rte_memcpy alignment
  2022-01-14 10:54     ` Ananyev, Konstantin
@ 2022-01-14 11:05       ` Morten Brørup
  2022-01-14 11:51         ` Ananyev, Konstantin
  0 siblings, 1 reply; 8+ messages in thread
From: Morten Brørup @ 2022-01-14 11:05 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson,  Bruce
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, dev

> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Friday, 14 January 2022 11.54
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Friday, January 14, 2022 9:54 AM
> >
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 14 January 2022 10.11
> > >
> > > On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > > > Dear ARM/POWER/x86 maintainers,
> > > >
> > > > The architecture specific rte_memcpy() provides optimized
> variants to
> > > copy aligned data. However, the alignment requirements depend on
> the
> > > hardware architecture, and there is no common definition for the
> > > alignment.
> > > >
> > > > DPDK provides __rte_cache_aligned for cache optimization
> purposes,
> > > with architecture specific values. Would you consider providing an
> > > __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> > > >
> > > > Or should I just use __rte_cache_aligned, although it is
> overkill?
> > > >
> > > >
> > > > Specifically, I am working on a mempool optimization where the
> objs
> > > field in the rte_mempool_cache structure may benefit by being
> aligned
> > > for optimized rte_memcpy().
> > > >
> > > For me the difficulty with such a memcpy proposal - apart from
> probably
> > > adding to the amount of memcpy code we have to maintain - is the
> > > specific meaning
> > > of what "aligned" in the memcpy case. Unlike for a struct
> definition,
> > > the
> > > possible meaning of aligned in memcpy could be:
> > > * the source address is aligned
> > > * the destination address is aligned
> > > * both source and destination is aligned
> > > * both source and destination are aligned and the copy length is a
> > > multiple
> > >   of the alignment length
> > > * the data is aligned to a cacheline boundary
> > > * the data is aligned to the largest load-store size for system
> > > * the data is aligned to the boundary suitable for the copy size,
> e.g.
> > >   memcpy of 8 bytes is 8-byte aligned etc.
> > >
> > > Can you clarify a bit more on your own thinking here? Personally, I
> am
> > > a
> > > little dubious of the benefit of general memcpy optimization, but I
> do
> > > believe that for specific usecases there is value is having their
> own
> > > copy
> > > operations which include constraints for that specific usecase. For
> > > example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from
> the
> > > mempool cache into the descriptor rearm function because we know we
> can
> > > always do 64-byte loads and stores, and also because we know that
> for
> > > each
> > > load in the copy, we can reuse the data just after storing it
> (giving
> > > good
> > > perf boost). Perhaps something similar could work for you in your
> > > mempool
> > > optimization.
> > >
> > > /Bruce
> >
> > I'm going to copy array of pointers, specifically the 'objs' array in
> the rte_mempool_cache structure.
> >
> > The 'objs' array starts at byte 24, which is only 8 byte aligned. So
> it always fails the ALIGNMENT_MASK test in the x86 specific
> > rte_memcpy(), and thus cannot ever use the optimized
> rte_memcpy_aligned() function to copy the array, but will use the
> > rte_memcpy_generic() function.
> >
> > If the 'objs' array was optimally aligned, and the other array that
> is being copied to/from is also optimally aligned, rte_memcpy() would
> use
> > the optimized rte_memcpy_aligned() function.
> >
> > Please also note that the value of ALIGNMENT_MASK depends on which
> vector instruction set DPDK is being compiled with.
> >
> > The other CPU architectures have similar stuff in their rte_memcpy()
> implementations, and their alignment requirements are also different.
> >
> > Please also note that rte_memcpy() becomes even more optimized when
> the size of the memcpy() operation is known at compile time.
> 
> If the size is known at compile time, rte_memcpy() probably an overkill
> - modern compilers usually generate fast enough code for such cases.
> 
> >
> > So I am asking for a public #define __rte_memcpy_aligned I can use to
> meet the alignment requirements for optimal rte_memcpy().
> 
> Even on x86 ALIGNMENT_MASK could have different values (15/31/63)
> depending on ISA.
> So probably 64 as 'generic' one is the safest bet.

I will use cache line alignment for now.

> Though I wonder do we really need such micro-optimizations here?

I'm not sure, but since it's available, I will use it. :-)

And the mempool get/put functions are very frequently used, so I think we should squeeze out every bit of performance we can.

> Would it be such huge difference if you call rte_memcpy_aligned()
> instead of rte_memcpy()?

rte_memcpy_aligned() is x86 only.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: rte_memcpy alignment
  2022-01-14 11:05       ` Morten Brørup
@ 2022-01-14 11:51         ` Ananyev, Konstantin
  2022-01-17 12:03           ` Morten Brørup
  0 siblings, 1 reply; 8+ messages in thread
From: Ananyev, Konstantin @ 2022-01-14 11:51 UTC (permalink / raw)
  To: Morten Brørup, Richardson,  Bruce
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, dev



> 
> > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> > Sent: Friday, 14 January 2022 11.54
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Friday, January 14, 2022 9:54 AM
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Friday, 14 January 2022 10.11
> > > >
> > > > On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > > > > Dear ARM/POWER/x86 maintainers,
> > > > >
> > > > > The architecture specific rte_memcpy() provides optimized
> > variants to
> > > > copy aligned data. However, the alignment requirements depend on
> > the
> > > > hardware architecture, and there is no common definition for the
> > > > alignment.
> > > > >
> > > > > DPDK provides __rte_cache_aligned for cache optimization
> > purposes,
> > > > with architecture specific values. Would you consider providing an
> > > > __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> > > > >
> > > > > Or should I just use __rte_cache_aligned, although it is
> > overkill?
> > > > >
> > > > >
> > > > > Specifically, I am working on a mempool optimization where the
> > objs
> > > > field in the rte_mempool_cache structure may benefit by being
> > aligned
> > > > for optimized rte_memcpy().
> > > > >
> > > > For me the difficulty with such a memcpy proposal - apart from
> > probably
> > > > adding to the amount of memcpy code we have to maintain - is the
> > > > specific meaning
> > > > of what "aligned" in the memcpy case. Unlike for a struct
> > definition,
> > > > the
> > > > possible meaning of aligned in memcpy could be:
> > > > * the source address is aligned
> > > > * the destination address is aligned
> > > > * both source and destination is aligned
> > > > * both source and destination are aligned and the copy length is a
> > > > multiple
> > > >   of the alignment length
> > > > * the data is aligned to a cacheline boundary
> > > > * the data is aligned to the largest load-store size for system
> > > > * the data is aligned to the boundary suitable for the copy size,
> > e.g.
> > > >   memcpy of 8 bytes is 8-byte aligned etc.
> > > >
> > > > Can you clarify a bit more on your own thinking here? Personally, I
> > am
> > > > a
> > > > little dubious of the benefit of general memcpy optimization, but I
> > do
> > > > believe that for specific usecases there is value is having their
> > own
> > > > copy
> > > > operations which include constraints for that specific usecase. For
> > > > example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from
> > the
> > > > mempool cache into the descriptor rearm function because we know we
> > can
> > > > always do 64-byte loads and stores, and also because we know that
> > for
> > > > each
> > > > load in the copy, we can reuse the data just after storing it
> > (giving
> > > > good
> > > > perf boost). Perhaps something similar could work for you in your
> > > > mempool
> > > > optimization.
> > > >
> > > > /Bruce
> > >
> > > I'm going to copy array of pointers, specifically the 'objs' array in
> > the rte_mempool_cache structure.
> > >
> > > The 'objs' array starts at byte 24, which is only 8 byte aligned. So
> > it always fails the ALIGNMENT_MASK test in the x86 specific
> > > rte_memcpy(), and thus cannot ever use the optimized
> > rte_memcpy_aligned() function to copy the array, but will use the
> > > rte_memcpy_generic() function.
> > >
> > > If the 'objs' array was optimally aligned, and the other array that
> > is being copied to/from is also optimally aligned, rte_memcpy() would
> > use
> > > the optimized rte_memcpy_aligned() function.
> > >
> > > Please also note that the value of ALIGNMENT_MASK depends on which
> > vector instruction set DPDK is being compiled with.
> > >
> > > The other CPU architectures have similar stuff in their rte_memcpy()
> > implementations, and their alignment requirements are also different.
> > >
> > > Please also note that rte_memcpy() becomes even more optimized when
> > the size of the memcpy() operation is known at compile time.
> >
> > If the size is known at compile time, rte_memcpy() probably an overkill
> > - modern compilers usually generate fast enough code for such cases.
> >
> > >
> > > So I am asking for a public #define __rte_memcpy_aligned I can use to
> > meet the alignment requirements for optimal rte_memcpy().
> >
> > Even on x86 ALIGNMENT_MASK could have different values (15/31/63)
> > depending on ISA.
> > So probably 64 as 'generic' one is the safest bet.
> 
> I will use cache line alignment for now.
> 
> > Though I wonder do we really need such micro-optimizations here?
> 
> I'm not sure, but since it's available, I will use it. :-)
> 
> And the mempool get/put functions are very frequently used, so I think we should squeeze out every bit of performance we can.

Well it wouldn't come for free, right?
You would probably need to do some extra checking and add handling for non-aligned cases.
Anyway, will probably just wait for the patch before going into further discussions :)

> 
> > Would it be such huge difference if you call rte_memcpy_aligned()
> > instead of rte_memcpy()?
> 
> rte_memcpy_aligned() is x86 only.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: rte_memcpy alignment
  2022-01-14 11:51         ` Ananyev, Konstantin
@ 2022-01-17 12:03           ` Morten Brørup
  0 siblings, 0 replies; 8+ messages in thread
From: Morten Brørup @ 2022-01-17 12:03 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson,  Bruce
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, dev

> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Friday, 14 January 2022 12.52
> 
> >
> > > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> > > Sent: Friday, 14 January 2022 11.54
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: Friday, January 14, 2022 9:54 AM
> > > >
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Friday, 14 January 2022 10.11
> > > > >
> > > > > On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Brørup wrote:
> > > > > > Dear ARM/POWER/x86 maintainers,
> > > > > >
> > > > > > The architecture specific rte_memcpy() provides optimized
> > > variants to
> > > > > copy aligned data. However, the alignment requirements depend
> on
> > > the
> > > > > hardware architecture, and there is no common definition for
> the
> > > > > alignment.
> > > > > >
> > > > > > DPDK provides __rte_cache_aligned for cache optimization
> > > purposes,
> > > > > with architecture specific values. Would you consider providing
> an
> > > > > __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> > > > > >
> > > > > > Or should I just use __rte_cache_aligned, although it is
> > > overkill?
> > > > > >
> > > > > >
> > > > > > Specifically, I am working on a mempool optimization where
> the
> > > objs
> > > > > field in the rte_mempool_cache structure may benefit by being
> > > aligned
> > > > > for optimized rte_memcpy().
> > > > > >
> > > > > For me the difficulty with such a memcpy proposal - apart from
> > > probably
> > > > > adding to the amount of memcpy code we have to maintain - is
> the
> > > > > specific meaning
> > > > > of what "aligned" in the memcpy case. Unlike for a struct
> > > definition,
> > > > > the
> > > > > possible meaning of aligned in memcpy could be:
> > > > > * the source address is aligned
> > > > > * the destination address is aligned
> > > > > * both source and destination is aligned
> > > > > * both source and destination are aligned and the copy length
> is a
> > > > > multiple
> > > > >   of the alignment length
> > > > > * the data is aligned to a cacheline boundary
> > > > > * the data is aligned to the largest load-store size for system
> > > > > * the data is aligned to the boundary suitable for the copy
> size,
> > > e.g.
> > > > >   memcpy of 8 bytes is 8-byte aligned etc.
> > > > >
> > > > > Can you clarify a bit more on your own thinking here?
> Personally, I
> > > am
> > > > > a
> > > > > little dubious of the benefit of general memcpy optimization,
> but I
> > > do
> > > > > believe that for specific usecases there is value is having
> their
> > > own
> > > > > copy
> > > > > operations which include constraints for that specific usecase.
> For
> > > > > example, in the AVX-512 ice/i40e PMD code, we fold the memcpy
> from
> > > the
> > > > > mempool cache into the descriptor rearm function because we
> know we
> > > can
> > > > > always do 64-byte loads and stores, and also because we know
> that
> > > for
> > > > > each
> > > > > load in the copy, we can reuse the data just after storing it
> > > (giving
> > > > > good
> > > > > perf boost). Perhaps something similar could work for you in
> your
> > > > > mempool
> > > > > optimization.
> > > > >
> > > > > /Bruce
> > > >
> > > > I'm going to copy array of pointers, specifically the 'objs'
> array in
> > > the rte_mempool_cache structure.
> > > >
> > > > The 'objs' array starts at byte 24, which is only 8 byte aligned.
> So
> > > it always fails the ALIGNMENT_MASK test in the x86 specific
> > > > rte_memcpy(), and thus cannot ever use the optimized
> > > rte_memcpy_aligned() function to copy the array, but will use the
> > > > rte_memcpy_generic() function.
> > > >
> > > > If the 'objs' array was optimally aligned, and the other array
> that
> > > is being copied to/from is also optimally aligned, rte_memcpy()
> would
> > > use
> > > > the optimized rte_memcpy_aligned() function.
> > > >
> > > > Please also note that the value of ALIGNMENT_MASK depends on
> which
> > > vector instruction set DPDK is being compiled with.
> > > >
> > > > The other CPU architectures have similar stuff in their
> rte_memcpy()
> > > implementations, and their alignment requirements are also
> different.
> > > >
> > > > Please also note that rte_memcpy() becomes even more optimized
> when
> > > the size of the memcpy() operation is known at compile time.
> > >
> > > If the size is known at compile time, rte_memcpy() probably an
> overkill
> > > - modern compilers usually generate fast enough code for such
> cases.
> > >
> > > >
> > > > So I am asking for a public #define __rte_memcpy_aligned I can
> use to
> > > meet the alignment requirements for optimal rte_memcpy().
> > >
> > > Even on x86 ALIGNMENT_MASK could have different values (15/31/63)
> > > depending on ISA.
> > > So probably 64 as 'generic' one is the safest bet.
> >
> > I will use cache line alignment for now.

Dear ARM/POWER/x86 maintainers,

Please forget my request.

I am quite confident that __rte_cache_aligned suffices for rte_memcpy() purposes too, so there is no need to introduce one more definition.

> >
> > > Though I wonder do we really need such micro-optimizations here?
> >
> > I'm not sure, but since it's available, I will use it. :-)
> >
> > And the mempool get/put functions are very frequently used, so I
> think we should squeeze out every bit of performance we can.
> 
> Well it wouldn't come for free, right?
> You would probably need to do some extra checking and add handling for
> non-aligned cases.
> Anyway, will probably just wait for the patch before going into further
> discussions :)

Konstantin was right!

Mempool_perf_autotest revealed that rte_memcpy() was inefficient, so I used a different method in the patch:

http://inbox.dpdk.org/dev/20220117115231.8060-1-mb@smartsharesystems.com/T/#u


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-17 12:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-14  8:56 rte_memcpy alignment Morten Brørup
2022-01-14  9:11 ` Bruce Richardson
2022-01-14  9:53   ` Morten Brørup
2022-01-14 10:22     ` Bruce Richardson
2022-01-14 10:54     ` Ananyev, Konstantin
2022-01-14 11:05       ` Morten Brørup
2022-01-14 11:51         ` Ananyev, Konstantin
2022-01-17 12:03           ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).