DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] rte_ether_addr_copy() strange comment
@ 2020-06-25 15:45 Morten Brørup
  2020-06-26 12:08 ` Ferruh Yigit
  0 siblings, 1 reply; 8+ messages in thread
From: Morten Brørup @ 2020-06-25 15:45 UTC (permalink / raw)
  To: dev; +Cc: Olivier Matz

The function rte_ether_addr_copy() checks for __INTEL_COMPILER and has a comment about "a strange gcc warning". It says:

static inline void rte_ether_addr_copy(const struct rte_ether_addr *ea_from,
				   struct rte_ether_addr *ea_to)
{
#ifdef __INTEL_COMPILER
	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);

	to_words[0] = from_words[0];
	to_words[1] = from_words[1];
	to_words[2] = from_words[2];
#else
	/*
	 * Use the common way, because of a strange gcc warning.
	 */
	*ea_to = *ea_from;
#endif
}

I can see that from_words discards the const qualifier. Is that the "strange" gcc warning the comment is referring to?

This goes back to before the first public release of DPDK in 2013, ref. https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h


It should be fixed as follows:

-	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
-	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
+	const uint16_t *from_words = (const uint16_t *)ea_from;
+	uint16_t       *to_words   = (uint16_t *)ea_to;

And the consequential question: Is copying the three shorts faster than copying the struct? In other words: Should we get rid of the #ifdef and use the first method only?

PS: I will provide a patch which improves rte_is_broadcast_ether_addr() too. The magic formula here is: return (w[0] & w[1] & w[2]) == 0xFFFF;


Med venlig hilsen / kind regards
- Morten Brørup



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-25 15:45 [dpdk-dev] rte_ether_addr_copy() strange comment Morten Brørup
@ 2020-06-26 12:08 ` Ferruh Yigit
  2020-06-26 12:34   ` Morten Brørup
  2020-06-26 12:41   ` Van Haaren, Harry
  0 siblings, 2 replies; 8+ messages in thread
From: Ferruh Yigit @ 2020-06-26 12:08 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: Olivier Matz, Harry Van Haaren, Konstantin Ananyev

On 6/25/2020 4:45 PM, Morten Brørup wrote:
> The function rte_ether_addr_copy() checks for __INTEL_COMPILER and has a comment about "a strange gcc warning". It says:
> 
> static inline void rte_ether_addr_copy(const struct rte_ether_addr *ea_from,
> 				   struct rte_ether_addr *ea_to)
> {
> #ifdef __INTEL_COMPILER
> 	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> 	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> 
> 	to_words[0] = from_words[0];
> 	to_words[1] = from_words[1];
> 	to_words[2] = from_words[2];
> #else
> 	/*
> 	 * Use the common way, because of a strange gcc warning.
> 	 */
> 	*ea_to = *ea_from;
> #endif
> }
> 
> I can see that from_words discards the const qualifier. Is that the "strange" gcc warning the comment is referring to?
> 
> This goes back to before the first public release of DPDK in 2013, ref. https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h
> 
> 
> It should be fixed as follows:
> 
> -	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> -	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> +	const uint16_t *from_words = (const uint16_t *)ea_from;
> +	uint16_t       *to_words   = (uint16_t *)ea_to;
> 
> And the consequential question: Is copying the three shorts faster than copying the struct? In other words: Should we get rid of the #ifdef and use the first method only?


I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn

First I don't see the "strange" gcc warning with various gcc versions there.

Related to the struct copy vs word copy, struct copy seems with less instruction
[1],[2],
my vote to remove ifdef and keep struct copy.


[1] copy as individual function
[1a] gcc 10.1, struct copy:
copy:
        movdqa  (%rsi), %xmm0
        movaps  %xmm0, (%rdi)
        ret

[1b] gcc 10.1, word copy:
copy:
        movzwl  (%rsi), %eax
        movw    %ax, (%rdi)
        movzwl  2(%rsi), %eax
        movw    %ax, 2(%rdi)
        movzwl  4(%rsi), %eax
        movw    %ax, 4(%rdi)
        ret

[1c] icc 19.0.1, struct copy
copy:
        movups    (%rsi), %xmm0                                 #19.13
        movups    %xmm0, (%rdi)                                 #19.13
        ret


[2] gcc 10.1, copy as inline function that knows the data, both seems similar
// .addr = {1, 1, 1, 1, 1, 1},
[2a] struct copy:
...
        movl    $257, %eax
        movw    %ax, 4(%rsp)
        leaq    16(%rsp), %rdi
        movl    $16843009, (%rsp)
        movdqa  (%rsp), %xmm0
        movaps  %xmm0, 16(%rsp)
...

[2b] word copy:
        movl    $257, %eax
        movq    %rsp, %rdi
        movw    %ax, 4(%rsp)
        movl    $16843009, (%rsp)

> 
> PS: I will provide a patch which improves rte_is_broadcast_ether_addr() too. The magic formula here is: return (w[0] & w[1] & w[2]) == 0xFFFF;
> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 12:08 ` Ferruh Yigit
@ 2020-06-26 12:34   ` Morten Brørup
  2020-06-26 14:37     ` Jerin Jacob
  2020-06-26 12:41   ` Van Haaren, Harry
  1 sibling, 1 reply; 8+ messages in thread
From: Morten Brørup @ 2020-06-26 12:34 UTC (permalink / raw)
  To: Ferruh Yigit, dev; +Cc: Olivier Matz, Harry Van Haaren, Konstantin Ananyev

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit
> Sent: Friday, June 26, 2020 2:08 PM
> 
> On 6/25/2020 4:45 PM, Morten Brørup wrote:
> > The function rte_ether_addr_copy() checks for __INTEL_COMPILER and
> has a comment about "a strange gcc warning". It says:
> >
> > static inline void rte_ether_addr_copy(const struct rte_ether_addr
> *ea_from,
> > 				   struct rte_ether_addr *ea_to)
> > {
> > #ifdef __INTEL_COMPILER
> > 	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > 	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> >
> > 	to_words[0] = from_words[0];
> > 	to_words[1] = from_words[1];
> > 	to_words[2] = from_words[2];
> > #else
> > 	/*
> > 	 * Use the common way, because of a strange gcc warning.
> > 	 */
> > 	*ea_to = *ea_from;
> > #endif
> > }
> >
> > I can see that from_words discards the const qualifier. Is that the
> "strange" gcc warning the comment is referring to?
> >
> > This goes back to before the first public release of DPDK in 2013,
> ref. https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h
> >
> >
> > It should be fixed as follows:
> >
> > -	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > -	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> > +	const uint16_t *from_words = (const uint16_t *)ea_from;
> > +	uint16_t       *to_words   = (uint16_t *)ea_to;
> >
> > And the consequential question: Is copying the three shorts faster
> than copying the struct? In other words: Should we get rid of the
> #ifdef and use the first method only?
> 
> 
> I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn
> 
> First I don't see the "strange" gcc warning with various gcc versions
> there.
> 
> Related to the struct copy vs word copy, struct copy seems with less
> instruction
> [1],[2],
> my vote to remove ifdef and keep struct copy.
> 
> 
> [1] copy as individual function
> [1a] gcc 10.1, struct copy:
> copy:
>         movdqa  (%rsi), %xmm0
>         movaps  %xmm0, (%rdi)
>         ret
> 
> [1b] gcc 10.1, word copy:
> copy:
>         movzwl  (%rsi), %eax
>         movw    %ax, (%rdi)
>         movzwl  2(%rsi), %eax
>         movw    %ax, 2(%rdi)
>         movzwl  4(%rsi), %eax
>         movw    %ax, 4(%rdi)
>         ret
> 
> [1c] icc 19.0.1, struct copy
> copy:
>         movups    (%rsi), %xmm0                                 #19.13
>         movups    %xmm0, (%rdi)                                 #19.13
>         ret
> 
> 
> [2] gcc 10.1, copy as inline function that knows the data, both seems
> similar
> // .addr = {1, 1, 1, 1, 1, 1},
> [2a] struct copy:
> ...
>         movl    $257, %eax
>         movw    %ax, 4(%rsp)
>         leaq    16(%rsp), %rdi
>         movl    $16843009, (%rsp)
>         movdqa  (%rsp), %xmm0
>         movaps  %xmm0, 16(%rsp)
> ...
> 
> [2b] word copy:
>         movl    $257, %eax
>         movq    %rsp, %rdi
>         movw    %ax, 4(%rsp)
>         movl    $16843009, (%rsp)
> 

Thank you for the detailed response, Ferruh.

I didn't know about godbolt, so thank you for that reference too.

The address struct is 2 byte aligned, not 16 byte aligned. Modifying your test in godbolt to use 2 byte alignment gives a similar result, i.e. fewer instructions on both icc and gcc.

[1c-modified] icc 19.0.1, struct copy

copy:
        movl      (%rsi), %eax                                  #19.13
        movl      %eax, (%rdi)                                  #19.13
        movzwl    4(%rsi), %edx                                 #19.13
        movw      %dx, 4(%rdi)                                  #19.13
        ret                                                     #28.1

[1d-modified] icc 19.0.1, word copy
copy:
        movzwl    (%rsi), %eax                                  #24.12
        movw      %ax, (%rdi)                                   #24.5
        movzwl    2(%rsi), %edx                                 #25.12
        movw      %dx, 2(%rdi)                                  #25.5
        movzwl    4(%rsi), %ecx                                 #26.12
        movw      %cx, 4(%rdi)                                  #26.5
        ret                                                     #28.1

Testing for ARM64 on godbolt gives a similar result: more instructions using word copy than struct copy.

In conclusion, I will proceed with the struct copy.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 12:08 ` Ferruh Yigit
  2020-06-26 12:34   ` Morten Brørup
@ 2020-06-26 12:41   ` Van Haaren, Harry
  2020-06-26 15:54     ` Ferruh Yigit
  1 sibling, 1 reply; 8+ messages in thread
From: Van Haaren, Harry @ 2020-06-26 12:41 UTC (permalink / raw)
  To: Yigit, Ferruh, Morten Brørup, dev; +Cc: Olivier Matz, Ananyev, Konstantin

> -----Original Message-----
> From: Yigit, Ferruh <ferruh.yigit@intel.com>
> Sent: Friday, June 26, 2020 1:08 PM
> To: Morten Brørup <mb@smartsharesystems.com>; dev@dpdk.org
> Cc: Olivier Matz <olivier.matz@6wind.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] rte_ether_addr_copy() strange comment
> 
> On 6/25/2020 4:45 PM, Morten Brørup wrote:
> > The function rte_ether_addr_copy() checks for __INTEL_COMPILER and has a
> comment about "a strange gcc warning". It says:
> >
> > static inline void rte_ether_addr_copy(const struct rte_ether_addr *ea_from,
> > 				   struct rte_ether_addr *ea_to)
> > {
> > #ifdef __INTEL_COMPILER
> > 	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > 	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> >
> > 	to_words[0] = from_words[0];
> > 	to_words[1] = from_words[1];
> > 	to_words[2] = from_words[2];
> > #else
> > 	/*
> > 	 * Use the common way, because of a strange gcc warning.
> > 	 */
> > 	*ea_to = *ea_from;
> > #endif
> > }
> >
> > I can see that from_words discards the const qualifier. Is that the "strange" gcc
> warning the comment is referring to?
> >
> > This goes back to before the first public release of DPDK in 2013, ref.
> https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h
> >
> >
> > It should be fixed as follows:
> >
> > -	uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > -	uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> > +	const uint16_t *from_words = (const uint16_t *)ea_from;
> > +	uint16_t       *to_words   = (uint16_t *)ea_to;
> >
> > And the consequential question: Is copying the three shorts faster than
> copying the struct? In other words: Should we get rid of the #ifdef and use the
> first method only?
> 
> 
> I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn

There was a small hiccup in the struct mac definition, it is aligned to 2, not 16 as the above Godbolt link...
With the aligned attribute changed to 2 (as per DPDK header https://git.dpdk.org/dpdk/tree/lib/librte_net/rte_ether.h#n59 )
we get the required (but less performant) smaller stores:

WORD_COPY = 0, Aligned = 16
NOTE: Incorrect alignment provided, and invalid ASM as it stores over the 10 bytes after eth addr.
This code is from a GodBolt test only, and this bug is NOT present in upstream DPDK.
        movdqa  (%rsi), %xmm0
        movaps  %xmm0, (%rdi)
        ret

WORD_COPY = 0, Aligned = 2 (correct, as per dpdk header)
        movl    (%rsi), %eax
        movl    %eax, (%rdi)
        movzwl  4(%rsi), %eax
        movw    %ax, 4(%rdi)
        ret

Word Copy = 1 (aligned = 2)
        movzwl  (%rsi), %eax
        movw    %ax, (%rdi)
        movzwl  2(%rsi), %eax
        movw    %ax, 2(%rdi)
        movzwl  4(%rsi), %eax
        movw    %ax, 4(%rdi)
        ret

<snip previous output>

Ferruh said:
> Related to the struct copy vs word copy, struct copy seems with less instruction [1],[2],
> my vote to remove ifdef and keep struct copy.

+1 for struct copy here too, it gives the compiler space to optimize within its alignment bounds.
It does the best that it can (given a 6 byte area to store into), with 1x 4byte store, 1x 2byte store.

Regards, -Harry

PS: For extra bonus points, here's a SIMD version that only uses one store
https://godbolt.org/z/VAR2La. Unless you intend on copying billions of
L1 resident eth addrs, this may or may not be a useful optimization.
Note that it requires the 10 bytes after the ether addr to be valid to read.
It loads 16B across both SRC and DST, blends 48 bits of SRC into DST and
writes the result back to DST.
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        pblendw $7, %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        ret

Actually, its possible to do this using a uint64_t (8 byte scalar) load/store too,
with some masking and bitwise OR... left as an exercise to the reader? :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 12:34   ` Morten Brørup
@ 2020-06-26 14:37     ` Jerin Jacob
  0 siblings, 0 replies; 8+ messages in thread
From: Jerin Jacob @ 2020-06-26 14:37 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Ferruh Yigit, dpdk-dev, Olivier Matz, Harry Van Haaren,
	Konstantin Ananyev

On Fri, Jun 26, 2020 at 6:05 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit
> > Sent: Friday, June 26, 2020 2:08 PM
> >
> > On 6/25/2020 4:45 PM, Morten Brørup wrote:
> > > The function rte_ether_addr_copy() checks for __INTEL_COMPILER and
> > has a comment about "a strange gcc warning". It says:
> > >
> > > static inline void rte_ether_addr_copy(const struct rte_ether_addr
> > *ea_from,
> > >                                struct rte_ether_addr *ea_to)
> > > {
> > > #ifdef __INTEL_COMPILER
> > >     uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > >     uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> > >
> > >     to_words[0] = from_words[0];
> > >     to_words[1] = from_words[1];
> > >     to_words[2] = from_words[2];
> > > #else
> > >     /*
> > >      * Use the common way, because of a strange gcc warning.
> > >      */
> > >     *ea_to = *ea_from;
> > > #endif
> > > }
> > >
> > > I can see that from_words discards the const qualifier. Is that the
> > "strange" gcc warning the comment is referring to?
> > >
> > > This goes back to before the first public release of DPDK in 2013,
> > ref. https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h
> > >
> > >
> > > It should be fixed as follows:
> > >
> > > -   uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
> > > -   uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
> > > +   const uint16_t *from_words = (const uint16_t *)ea_from;
> > > +   uint16_t       *to_words   = (uint16_t *)ea_to;
> > >
> > > And the consequential question: Is copying the three shorts faster
> > than copying the struct? In other words: Should we get rid of the
> > #ifdef and use the first method only?
> >
> >
> > I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn
> >
> > First I don't see the "strange" gcc warning with various gcc versions
> > there.
> >
> > Related to the struct copy vs word copy, struct copy seems with less
> > instruction
> > [1],[2],
> > my vote to remove ifdef and keep struct copy.
> >
> >
> > [1] copy as individual function
> > [1a] gcc 10.1, struct copy:
> > copy:
> >         movdqa  (%rsi), %xmm0
> >         movaps  %xmm0, (%rdi)
> >         ret
> >
> > [1b] gcc 10.1, word copy:
> > copy:
> >         movzwl  (%rsi), %eax
> >         movw    %ax, (%rdi)
> >         movzwl  2(%rsi), %eax
> >         movw    %ax, 2(%rdi)
> >         movzwl  4(%rsi), %eax
> >         movw    %ax, 4(%rdi)
> >         ret
> >
> > [1c] icc 19.0.1, struct copy
> > copy:
> >         movups    (%rsi), %xmm0                                 #19.13
> >         movups    %xmm0, (%rdi)                                 #19.13
> >         ret
> >
> >
> > [2] gcc 10.1, copy as inline function that knows the data, both seems
> > similar
> > // .addr = {1, 1, 1, 1, 1, 1},
> > [2a] struct copy:
> > ...
> >         movl    $257, %eax
> >         movw    %ax, 4(%rsp)
> >         leaq    16(%rsp), %rdi
> >         movl    $16843009, (%rsp)
> >         movdqa  (%rsp), %xmm0
> >         movaps  %xmm0, 16(%rsp)
> > ...
> >
> > [2b] word copy:
> >         movl    $257, %eax
> >         movq    %rsp, %rdi
> >         movw    %ax, 4(%rsp)
> >         movl    $16843009, (%rsp)
> >
>
> Thank you for the detailed response, Ferruh.
>
> I didn't know about godbolt, so thank you for that reference too.
>
> The address struct is 2 byte aligned, not 16 byte aligned. Modifying your test in godbolt to use 2 byte alignment gives a similar result, i.e. fewer instructions on both icc and gcc.
>
> [1c-modified] icc 19.0.1, struct copy
>
> copy:
>         movl      (%rsi), %eax                                  #19.13
>         movl      %eax, (%rdi)                                  #19.13
>         movzwl    4(%rsi), %edx                                 #19.13
>         movw      %dx, 4(%rdi)                                  #19.13
>         ret                                                     #28.1
>
> [1d-modified] icc 19.0.1, word copy
> copy:
>         movzwl    (%rsi), %eax                                  #24.12
>         movw      %ax, (%rdi)                                   #24.5
>         movzwl    2(%rsi), %edx                                 #25.12
>         movw      %dx, 2(%rdi)                                  #25.5
>         movzwl    4(%rsi), %ecx                                 #26.12
>         movw      %cx, 4(%rdi)                                  #26.5
>         ret                                                     #28.1
>
> Testing for ARM64 on godbolt gives a similar result: more instructions using word copy than struct copy.
>
> In conclusion, I will proceed with the struct copy.

Since you are up to changing the code, Could you add __restrict
keyword for the further hint to the compiler for struct copy case?

http://mails.dpdk.org/archives/dev/2020-June/169876.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 12:41   ` Van Haaren, Harry
@ 2020-06-26 15:54     ` Ferruh Yigit
  2020-06-26 17:28       ` Van Haaren, Harry
  0 siblings, 1 reply; 8+ messages in thread
From: Ferruh Yigit @ 2020-06-26 15:54 UTC (permalink / raw)
  To: Van Haaren, Harry, Morten Brørup, dev
  Cc: Olivier Matz, Ananyev, Konstantin

On 6/26/2020 1:41 PM, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Yigit, Ferruh <ferruh.yigit@intel.com>
>> Sent: Friday, June 26, 2020 1:08 PM
>> To: Morten Brørup <mb@smartsharesystems.com>; dev@dpdk.org
>> Cc: Olivier Matz <olivier.matz@6wind.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>
>> Subject: Re: [dpdk-dev] rte_ether_addr_copy() strange comment
>>
>> On 6/25/2020 4:45 PM, Morten Brørup wrote:
>>> The function rte_ether_addr_copy() checks for __INTEL_COMPILER and has a
>> comment about "a strange gcc warning". It says:
>>>
>>> static inline void rte_ether_addr_copy(const struct rte_ether_addr *ea_from,
>>>    struct rte_ether_addr *ea_to)
>>> {
>>> #ifdef __INTEL_COMPILER
>>> uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
>>> uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
>>>
>>> to_words[0] = from_words[0];
>>> to_words[1] = from_words[1];
>>> to_words[2] = from_words[2];
>>> #else
>>> /*
>>>  * Use the common way, because of a strange gcc warning.
>>>  */
>>> *ea_to = *ea_from;
>>> #endif
>>> }
>>>
>>> I can see that from_words discards the const qualifier. Is that the "strange" gcc
>> warning the comment is referring to?
>>>
>>> This goes back to before the first public release of DPDK in 2013, ref.
>> https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h
>>>
>>>
>>> It should be fixed as follows:
>>>
>>> -uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes);
>>> -uint16_t *to_words   = (uint16_t *)(ea_to->addr_bytes);
>>> +const uint16_t *from_words = (const uint16_t *)ea_from;
>>> +uint16_t       *to_words   = (uint16_t *)ea_to;
>>>
>>> And the consequential question: Is copying the three shorts faster than
>> copying the struct? In other words: Should we get rid of the #ifdef and use the
>> first method only?
>>
>>
>> I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn
> 
> There was a small hiccup in the struct mac definition, it is aligned to 2, not 16 as the above Godbolt link...
> With the aligned attribute changed to 2 (as per DPDK header https://git.dpdk.org/dpdk/tree/lib/librte_net/rte_ether.h#n59 )
> we get the required (but less performant) smaller stores:
> 
> WORD_COPY = 0, Aligned = 16
> NOTE: Incorrect alignment provided, and invalid ASM as it stores over the 10 bytes after eth addr.
> This code is from a GodBolt test only, and this bug is NOT present in upstream DPDK.

Thanks for the clarification. (not sure how I end up testing 16 byte alignment)

<...>

> PS: For extra bonus points, here's a SIMD version that only uses one store
> https://godbolt.org/z/VAR2La. Unless you intend on copying billions of
> L1 resident eth addrs, this may or may not be a useful optimization.
> Note that it requires the 10 bytes after the ether addr to be valid to read.
> It loads 16B across both SRC and DST, blends 48 bits of SRC into DST and
> writes the result back to DST.
>         movdqu  (%rsi), %xmm0
>         movdqu  (%rdi), %xmm1
>         pblendw $7, %xmm1, %xmm0
>         movups  %xmm0, (%rdi)
>         ret
> 
> Actually, its possible to do this using a uint64_t (8 byte scalar) load/store too,
> with some masking and bitwise OR... left as an exercise to the reader? :)
> 
Does below work? (not for real life usage, just to experiment single store
solutions :) [https://godbolt.org/z/TmqwQh]

        movzwl  6(%rdi), %eax
        salq    $48, %rax
        orq     (%rsi), %rax
        movq    %rax, (%rdi)
        ret

----

void copy(struct mac *dst, const struct mac *src) {
    uint64_t *s = (uint64_t *) &src->addr;
    uint64_t *d = (uint64_t *) &dst->addr;
    uint16_t dd = ((uint16_t *)d)[3];
    *d = (*s & ~(0xffffUL<48)) | ((uint64_t)dd << 48);
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 15:54     ` Ferruh Yigit
@ 2020-06-26 17:28       ` Van Haaren, Harry
  2020-06-26 18:04         ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Van Haaren, Harry @ 2020-06-26 17:28 UTC (permalink / raw)
  To: Yigit, Ferruh, Morten Brørup, dev; +Cc: Olivier Matz, Ananyev, Konstantin

> -----Original Message-----
> From: Yigit, Ferruh <ferruh.yigit@intel.com>
> Sent: Friday, June 26, 2020 4:54 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; dev@dpdk.org
> Cc: Olivier Matz <olivier.matz@6wind.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] rte_ether_addr_copy() strange comment
> 
> On 6/26/2020 1:41 PM, Van Haaren, Harry wrote:
> >> -----Original Message-----

<snip serious conversation>

> > PS: For extra bonus points, here's a SIMD version that only uses one store
> > https://godbolt.org/z/VAR2La. Unless you intend on copying billions of
> > L1 resident eth addrs, this may or may not be a useful optimization.
> > Note that it requires the 10 bytes after the ether addr to be valid to read.
> > It loads 16B across both SRC and DST, blends 48 bits of SRC into DST and
> > writes the result back to DST.
> >         movdqu  (%rsi), %xmm0
> >         movdqu  (%rdi), %xmm1
> >         pblendw $7, %xmm1, %xmm0
> >         movups  %xmm0, (%rdi)
> >         ret
> >
> > Actually, its possible to do this using a uint64_t (8 byte scalar) load/store too,
> > with some masking and bitwise OR... left as an exercise to the reader? :)
> >
> Does below work? (not for real life usage, just to experiment single store
> solutions :) [https://godbolt.org/z/TmqwQh]
> 
>         movzwl  6(%rdi), %eax
>         salq    $48, %rax
>         orq     (%rsi), %rax
>         movq    %rax, (%rdi)
>         ret
> 
> ----
> 
> void copy(struct mac *dst, const struct mac *src) {
>     uint64_t *s = (uint64_t *) &src->addr;
>     uint64_t *d = (uint64_t *) &dst->addr;
>     uint16_t dd = ((uint16_t *)d)[3];
>     *d = (*s & ~(0xffffUL<48)) | ((uint64_t)dd << 48);
> }

My code-golf reviewing skills are probably not 100% at end-of-day on a Friday.. so I wrote a unit test ;)
Seems to check out yet - readers beware - this solution still overwrites 2 bytes past the dst mac data itself.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] rte_ether_addr_copy() strange comment
  2020-06-26 17:28       ` Van Haaren, Harry
@ 2020-06-26 18:04         ` Stephen Hemminger
  0 siblings, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2020-06-26 18:04 UTC (permalink / raw)
  To: Van Haaren, Harry
  Cc: Yigit, Ferruh, Morten Brørup, dev, Olivier Matz, Ananyev,
	Konstantin

On Fri, 26 Jun 2020 17:28:49 +0000
"Van Haaren, Harry" <harry.van.haaren@intel.com> wrote:

> > -----Original Message-----
> > From: Yigit, Ferruh <ferruh.yigit@intel.com>
> > Sent: Friday, June 26, 2020 4:54 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> > <mb@smartsharesystems.com>; dev@dpdk.org
> > Cc: Olivier Matz <olivier.matz@6wind.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] rte_ether_addr_copy() strange comment
> > 
> > On 6/26/2020 1:41 PM, Van Haaren, Harry wrote:  
> > >> -----Original Message-----  
> 
> <snip serious conversation>
> 
> > > PS: For extra bonus points, here's a SIMD version that only uses one store
> > > https://godbolt.org/z/VAR2La. Unless you intend on copying billions of
> > > L1 resident eth addrs, this may or may not be a useful optimization.
> > > Note that it requires the 10 bytes after the ether addr to be valid to read.
> > > It loads 16B across both SRC and DST, blends 48 bits of SRC into DST and
> > > writes the result back to DST.
> > >         movdqu  (%rsi), %xmm0
> > >         movdqu  (%rdi), %xmm1
> > >         pblendw $7, %xmm1, %xmm0
> > >         movups  %xmm0, (%rdi)
> > >         ret
> > >
> > > Actually, its possible to do this using a uint64_t (8 byte scalar) load/store too,
> > > with some masking and bitwise OR... left as an exercise to the reader? :)
> > >  
> > Does below work? (not for real life usage, just to experiment single store
> > solutions :) [https://godbolt.org/z/TmqwQh]
> > 
> >         movzwl  6(%rdi), %eax
> >         salq    $48, %rax
> >         orq     (%rsi), %rax
> >         movq    %rax, (%rdi)
> >         ret
> > 
> > ----
> > 
> > void copy(struct mac *dst, const struct mac *src) {
> >     uint64_t *s = (uint64_t *) &src->addr;
> >     uint64_t *d = (uint64_t *) &dst->addr;
> >     uint16_t dd = ((uint16_t *)d)[3];
> >     *d = (*s & ~(0xffffUL<48)) | ((uint64_t)dd << 48);
> > }  
> 
> My code-golf reviewing skills are probably not 100% at end-of-day on a Friday.. so I wrote a unit test ;)
> Seems to check out yet - readers beware - this solution still overwrites 2 bytes past the dst mac data itself.
> 

The Linux kernel equivalent is:

static inline void ether_addr_copy(u8 *dst, const u8 *src)
{
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
	*(u32 *)dst = *(const u32 *)src;
	*(u16 *)(dst + 4) = *(const u16 *)(src + 4);
#else
	u16 *a = (u16 *)dst;
	const u16 *b = (const u16 *)src;

	a[0] = b[0];
	a[1] = b[1];
	a[2] = b[2];
#endif
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-26 18:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 15:45 [dpdk-dev] rte_ether_addr_copy() strange comment Morten Brørup
2020-06-26 12:08 ` Ferruh Yigit
2020-06-26 12:34   ` Morten Brørup
2020-06-26 14:37     ` Jerin Jacob
2020-06-26 12:41   ` Van Haaren, Harry
2020-06-26 15:54     ` Ferruh Yigit
2020-06-26 17:28       ` Van Haaren, Harry
2020-06-26 18:04         ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).