DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH] net/af_packet: cache align Rx/Tx structs
@ 2024-04-23  9:08 Mattias Rönnblom
  2024-04-23 11:15 ` Ferruh Yigit
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-23  9:08 UTC (permalink / raw)
  To: John W . Linville; +Cc: dev, hofors, Mattias Rönnblom

Cache align Rx and Tx queue struct to avoid false sharing.

RX struct happens to be 64 bytes on x86_64 already, so cache alignment
makes no change there, but it does on 32-bit ISAs.

TX struct is 56 bytes on x86_64.

Both structs keep counters, and in the RX case they are updated even
for empty polls.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db58..28aeb7d08e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@
  * All rights reserved.
  */
 
+#include <rte_common.h>
 #include <rte_string_fns.h>
 #include <rte_mbuf.h>
 #include <ethdev_driver.h>
@@ -53,7 +54,7 @@ struct pkt_rx_queue {
 
 	volatile unsigned long rx_pkts;
 	volatile unsigned long rx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pkt_tx_queue {
 	int sockfd;
@@ -67,7 +68,7 @@ struct pkt_tx_queue {
 	volatile unsigned long tx_pkts;
 	volatile unsigned long err_pkts;
 	volatile unsigned long tx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pmd_internals {
 	unsigned nb_queues;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23  9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom
@ 2024-04-23 11:15 ` Ferruh Yigit
  2024-04-23 20:56   ` Mattias Rönnblom
  2024-04-25 14:08   ` Ferruh Yigit
  2024-04-26  7:38 ` Mattias Rönnblom
  2024-04-26 21:27 ` [PATCH] " Patrick Robb
  2 siblings, 2 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-23 11:15 UTC (permalink / raw)
  To: Mattias Rönnblom, John W . Linville; +Cc: dev, hofors, Tyler Retzlaff

On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
> Cache align Rx and Tx queue struct to avoid false sharing.
> 
> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
> makes no change there, but it does on 32-bit ISAs.
> 
> TX struct is 56 bytes on x86_64.
> 

Hi Mattias,

No objection to the patch. Is the improvement theoretical or do you
measure any improvement practically, if so how much is the improvement?

> Both structs keep counters, and in the RX case they are updated even
> for empty polls.
> 

Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
the loop?

> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index 397a32db58..28aeb7d08e 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -6,6 +6,7 @@
>   * All rights reserved.
>   */
>  
> +#include <rte_common.h>
>  #include <rte_string_fns.h>
>  #include <rte_mbuf.h>
>  #include <ethdev_driver.h>
> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>  
>  	volatile unsigned long rx_pkts;
>  	volatile unsigned long rx_bytes;
> -};
> +} __rte_cache_aligned;
>  

Latest location for '__rte_cache_aligned' tag is at the beginning of the
struct [1], so something like:
`struct __rte_cache_aligned pkt_rx_queue {`

[1]
https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23 11:15 ` Ferruh Yigit
@ 2024-04-23 20:56   ` Mattias Rönnblom
  2024-04-24  0:27     ` Honnappa Nagarahalli
  2024-04-24 10:21     ` Ferruh Yigit
  2024-04-25 14:08   ` Ferruh Yigit
  1 sibling, 2 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-23 20:56 UTC (permalink / raw)
  To: Ferruh Yigit, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff

On 2024-04-23 13:15, Ferruh Yigit wrote:
> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>> Cache align Rx and Tx queue struct to avoid false sharing.
>>
>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>> makes no change there, but it does on 32-bit ISAs.
>>
>> TX struct is 56 bytes on x86_64.
>>
> 
> Hi Mattias,
> 
> No objection to the patch. Is the improvement theoretical or do you
> measure any improvement practically, if so how much is the improvement?
> 

I didn't run any benchmarks.

Two cores storing to a (falsely) shared cache line on a per-packet basis 
is going to be very expensive, at least for "light touch" applications.

>> Both structs keep counters, and in the RX case they are updated even
>> for empty polls.
>>
> 
> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
> the loop?
> 

No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes 
are declared volatile, so you are forcing a load-modify-store cycle for 
every increment.

I would drop "volatile", or replace it with an atomic (although *not* 
use an atomic add for incrementing, but rather atomic load + <n> 
non-atomic adds + atomic store).

>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db58..28aeb7d08e 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -6,6 +6,7 @@
>>    * All rights reserved.
>>    */
>>   
>> +#include <rte_common.h>
>>   #include <rte_string_fns.h>
>>   #include <rte_mbuf.h>
>>   #include <ethdev_driver.h>
>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>   
>>   	volatile unsigned long rx_pkts;
>>   	volatile unsigned long rx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>   
> 
> Latest location for '__rte_cache_aligned' tag is at the beginning of the
> struct [1], so something like:
> `struct __rte_cache_aligned pkt_rx_queue {`
> 
> [1]
> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23 20:56   ` Mattias Rönnblom
@ 2024-04-24  0:27     ` Honnappa Nagarahalli
  2024-04-24  6:28       ` Mattias Rönnblom
  2024-04-24 10:21     ` Ferruh Yigit
  1 sibling, 1 reply; 32+ messages in thread
From: Honnappa Nagarahalli @ 2024-04-24  0:27 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev,
	Tyler Retzlaff, nd, Honnappa Nagarahalli



> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
> On 2024-04-23 13:15, Ferruh Yigit wrote:
>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>> 
>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>> makes no change there, but it does on 32-bit ISAs.
>>> 
>>> TX struct is 56 bytes on x86_64.
>>> 
>> Hi Mattias,
>> No objection to the patch. Is the improvement theoretical or do you
>> measure any improvement practically, if so how much is the improvement?
> 
> I didn't run any benchmarks.
> 
> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications.
> 
>>> Both structs keep counters, and in the RX case they are updated even
>>> for empty polls.
>>> 
>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>> the loop?
> 
> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment.
> 
> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store).
(Slightly unrelated discussion)
Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed.

> 
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>   * All rights reserved.
>>>   */
>>>  +#include <rte_common.h>
>>>  #include <rte_string_fns.h>
>>>  #include <rte_mbuf.h>
>>>  #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>     volatile unsigned long rx_pkts;
>>>   volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>  
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24  0:27     ` Honnappa Nagarahalli
@ 2024-04-24  6:28       ` Mattias Rönnblom
  0 siblings, 0 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-24  6:28 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev,
	Tyler Retzlaff, nd

On 2024-04-24 02:27, Honnappa Nagarahalli wrote:
> 
> 
>> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>
>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>> makes no change there, but it does on 32-bit ISAs.
>>>>
>>>> TX struct is 56 bytes on x86_64.
>>>>
>>> Hi Mattias,
>>> No objection to the patch. Is the improvement theoretical or do you
>>> measure any improvement practically, if so how much is the improvement?
>>
>> I didn't run any benchmarks.
>>
>> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications.
>>
>>>> Both structs keep counters, and in the RX case they are updated even
>>>> for empty polls.
>>>>
>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>> the loop?
>>
>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment.
>>
>> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store).
> (Slightly unrelated discussion)
> Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed.
> 

Is it "far atomics" you have in mind? A C11 complaint compiler won't 
generate STADD (even for relaxed type ops), since it doesn't fit well 
into the C11 memory model. In particular, the issue is that STADD isn't 
ordered by the instructions generated by C11 fences, from what I understand.

GCC did generate STADD for a while, until this issue was discovered.

On x86_64, ADD <addr> is generally much faster than LOCK ADD <addr>. No 
wonder, since the latter is a full barrier.

If I'd worked for ARM I would probably had proposed some API extension 
to DPDK atomics API to allow the use of STADD (through inline 
assembler). One way to integrate it might be to add a new memory model, 
rte_memory_order_unordered or rte_memory_order_jumps_all_fences, where 
loads and stores are totally unordered.

However, in this particular case, it's a single-writer scenario, so 
rte_memory_order_kangaroo wouldn't really help, since it would be 
equivalent to rte_memory_order_relaxed on non-far atomics-capable ISAs, 
which would be too expensive. In the past I've argued for a 
single-writer version of atomic add/sub etc in the DPDK atomics API 
(whatever that is), for convenience and to make the issue/pattern known.

>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>    * All rights reserved.
>>>>    */
>>>>   +#include <rte_common.h>
>>>>   #include <rte_string_fns.h>
>>>>   #include <rte_mbuf.h>
>>>>   #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>      volatile unsigned long rx_pkts;
>>>>    volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>   
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23 20:56   ` Mattias Rönnblom
  2024-04-24  0:27     ` Honnappa Nagarahalli
@ 2024-04-24 10:21     ` Ferruh Yigit
  2024-04-24 10:28       ` Bruce Richardson
  2024-04-24 11:57       ` Mattias Rönnblom
  1 sibling, 2 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-24 10:21 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli

On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
> On 2024-04-23 13:15, Ferruh Yigit wrote:
>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>
>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>> makes no change there, but it does on 32-bit ISAs.
>>>
>>> TX struct is 56 bytes on x86_64.
>>>
>>
>> Hi Mattias,
>>
>> No objection to the patch. Is the improvement theoretical or do you
>> measure any improvement practically, if so how much is the improvement?
>>
> 
> I didn't run any benchmarks.
> 
> Two cores storing to a (falsely) shared cache line on a per-packet basis
> is going to be very expensive, at least for "light touch" applications.
> 

ack
I expect for af_packet bottleneck is the kernel side, so I don't expect
any visible improvement practically, but OK to fix this theoretical issue.

>>> Both structs keep counters, and in the RX case they are updated even
>>> for empty polls.
>>>
>>
>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>> the loop?
>>
> 
> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
> are declared volatile, so you are forcing a load-modify-store cycle for
> every increment.
> 

My intention was to prevent updating stats in empty polls, I thought
stats will be hot in the cache but won't work with volatile.


> I would drop "volatile", or replace it with an atomic (although *not*
> use an atomic add for incrementing, but rather atomic load + <n>
> non-atomic adds + atomic store).
> 

As only single core polls from an Rx queue, I assume atomics is required
only for stats reset case. And for that case won't we need atomic add?

And do you know if there is a performance difference between atomic add
and keeping volatile qualifier?


>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>    * All rights reserved.
>>>    */
>>>   +#include <rte_common.h>
>>>   #include <rte_string_fns.h>
>>>   #include <rte_mbuf.h>
>>>   #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>         volatile unsigned long rx_pkts;
>>>       volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>   
>>
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>>
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 10:21     ` Ferruh Yigit
@ 2024-04-24 10:28       ` Bruce Richardson
  2024-04-24 18:02         ` Ferruh Yigit
  2024-04-24 11:57       ` Mattias Rönnblom
  1 sibling, 1 reply; 32+ messages in thread
From: Bruce Richardson @ 2024-04-24 10:28 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville,
	dev, Tyler Retzlaff, Honnappa Nagarahalli

On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote:
> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
> > On 2024-04-23 13:15, Ferruh Yigit wrote:
> >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
> >>> Cache align Rx and Tx queue struct to avoid false sharing.
> >>>
> >>> RX struct happens to be 64 bytes on x86_64 already, so cache
> >>> alignment makes no change there, but it does on 32-bit ISAs.
> >>>
> >>> TX struct is 56 bytes on x86_64.
> >>>
> >>
> >> Hi Mattias,
> >>
> >> No objection to the patch. Is the improvement theoretical or do you
> >> measure any improvement practically, if so how much is the
> >> improvement?
> >>
> > 
> > I didn't run any benchmarks.
> > 
> > Two cores storing to a (falsely) shared cache line on a per-packet
> > basis is going to be very expensive, at least for "light touch"
> > applications.
> > 
> 
> ack I expect for af_packet bottleneck is the kernel side, so I don't
> expect any visible improvement practically, but OK to fix this
> theoretical issue.
> 
> >>> Both structs keep counters, and in the RX case they are updated even
> >>> for empty polls.
> >>>
> >>
> >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
> >> the loop?
> >>
> > 
> > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
> > are declared volatile, so you are forcing a load-modify-store cycle for
> > every increment.
> > 
> 
> My intention was to prevent updating stats in empty polls, I thought
> stats will be hot in the cache but won't work with volatile.
> 
Yes, it will. Volatile only prevents caching in registers, it does not
affect the storing of data within the cache hierarchy. Reads/writes of
stats on empty polls should indeed hit the L1 as expected. However, that is
still less efficient than just doing a register increment which could
theoretically be the result without the volatile.

/Bruce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 10:21     ` Ferruh Yigit
  2024-04-24 10:28       ` Bruce Richardson
@ 2024-04-24 11:57       ` Mattias Rönnblom
  2024-04-24 17:50         ` Ferruh Yigit
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-24 11:57 UTC (permalink / raw)
  To: Ferruh Yigit, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli

On 2024-04-24 12:21, Ferruh Yigit wrote:
> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>
>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>> makes no change there, but it does on 32-bit ISAs.
>>>>
>>>> TX struct is 56 bytes on x86_64.
>>>>
>>>
>>> Hi Mattias,
>>>
>>> No objection to the patch. Is the improvement theoretical or do you
>>> measure any improvement practically, if so how much is the improvement?
>>>
>>
>> I didn't run any benchmarks.
>>
>> Two cores storing to a (falsely) shared cache line on a per-packet basis
>> is going to be very expensive, at least for "light touch" applications.
>>
> 
> ack
> I expect for af_packet bottleneck is the kernel side, so I don't expect
> any visible improvement practically, but OK to fix this theoretical issue.
> 

If you use af_packet as a slow path interface, you may well poll it 
often. If you are unlucky and aren't on an ISA where the RX struct *by 
accident* is cache-line aligned, you will suffer greatly from false 
sharing, because the counters are written to regardless if there are 
packets polled or not.

If you don't care about synchronization overhead, why even support 
multi-queue in the af_packet driver!?

>>>> Both structs keep counters, and in the RX case they are updated even
>>>> for empty polls.
>>>>
>>>
>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>> the loop?
>>>
>>
>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>> are declared volatile, so you are forcing a load-modify-store cycle for
>> every increment.
>>
> 
> My intention was to prevent updating stats in empty polls, I thought
> stats will be hot in the cache but won't work with volatile.
> 
> 
>> I would drop "volatile", or replace it with an atomic (although *not*
>> use an atomic add for incrementing, but rather atomic load + <n>
>> non-atomic adds + atomic store).
>>
> 
> As only single core polls from an Rx queue, I assume atomics is required
> only for stats reset case. And for that case won't we need atomic add?
> 

Correct me if I'm wrong here, but my impression is that statistics reset 
tend to be best-effort in DPDK. Or, expressed in a different way, are 
not thread-safe.

If you want to support MT safe counter reset, the counter increment 
operation must be an atomic add, since now you have two writers. (Unless 
you do something fancy and keep only a new offset upon reset, and never 
actually zero the counter.)

> And do you know if there is a performance difference between atomic add
> and keeping volatile qualifier?
> 
> 

I don't know how slow af_packet is, but if you care about performance, 
you don't want to use atomic add for statistics.

volatile in this case it's not a big issue, performance-wise, I would 
expect.

>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>     * All rights reserved.
>>>>     */
>>>>    +#include <rte_common.h>
>>>>    #include <rte_string_fns.h>
>>>>    #include <rte_mbuf.h>
>>>>    #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>          volatile unsigned long rx_pkts;
>>>>        volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>    
>>>
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 11:57       ` Mattias Rönnblom
@ 2024-04-24 17:50         ` Ferruh Yigit
  2024-04-24 19:13           ` Stephen Hemminger
  0 siblings, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-24 17:50 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli

On 4/24/2024 12:57 PM, Mattias Rönnblom wrote:
> On 2024-04-24 12:21, Ferruh Yigit wrote:
>> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>>
>>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>>> makes no change there, but it does on 32-bit ISAs.
>>>>>
>>>>> TX struct is 56 bytes on x86_64.
>>>>>
>>>>
>>>> Hi Mattias,
>>>>
>>>> No objection to the patch. Is the improvement theoretical or do you
>>>> measure any improvement practically, if so how much is the improvement?
>>>>
>>>
>>> I didn't run any benchmarks.
>>>
>>> Two cores storing to a (falsely) shared cache line on a per-packet basis
>>> is going to be very expensive, at least for "light touch" applications.
>>>
>>
>> ack
>> I expect for af_packet bottleneck is the kernel side, so I don't expect
>> any visible improvement practically, but OK to fix this theoretical
>> issue.
>>
> 
> If you use af_packet as a slow path interface, you may well poll it
> often. If you are unlucky and aren't on an ISA where the RX struct *by
> accident* is cache-line aligned, you will suffer greatly from false
> sharing, because the counters are written to regardless if there are
> packets polled or not.
> 
> If you don't care about synchronization overhead, why even support
> multi-queue in the af_packet driver!?
> 

We care, only I expect no measurable performance improvement and was
asking if you have data for it.

>>>>> Both structs keep counters, and in the RX case they are updated even
>>>>> for empty polls.
>>>>>
>>>>
>>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>>> the loop?
>>>>
>>>
>>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>>> are declared volatile, so you are forcing a load-modify-store cycle for
>>> every increment.
>>>
>>
>> My intention was to prevent updating stats in empty polls, I thought
>> stats will be hot in the cache but won't work with volatile.
>>
>>
>>> I would drop "volatile", or replace it with an atomic (although *not*
>>> use an atomic add for incrementing, but rather atomic load + <n>
>>> non-atomic adds + atomic store).
>>>
>>
>> As only single core polls from an Rx queue, I assume atomics is required
>> only for stats reset case. And for that case won't we need atomic add?
>>
> 
> Correct me if I'm wrong here, but my impression is that statistics reset
> tend to be best-effort in DPDK. Or, expressed in a different way, are
> not thread-safe.
> 

At least I don't see this is something documented. For cases HW tracks
stats we don't have this problem. Or some drivers use offset mechanism
you mentioned below.
But I can see some soft drivers will have this problem with stats clear.

> If you want to support MT safe counter reset, the counter increment
> operation must be an atomic add, since now you have two writers. (Unless
> you do something fancy and keep only a new offset upon reset, and never
> actually zero the counter.)
> 

Asking to learn, if we ignore the stats reset case and assume stats only
incremented, single writer case, do we still need either volatile
qualifier or atomic operations at all?

>> And do you know if there is a performance difference between atomic add
>> and keeping volatile qualifier?
>>
>>
> 
> I don't know how slow af_packet is, but if you care about performance,
> you don't want to use atomic add for statistics.
> 

There are a few soft drivers already using atomics adds for updating stats.
If we document expectations from 'rte_eth_stats_reset()', we can update
those usages.

> volatile in this case it's not a big issue, performance-wise, I would
> expect.
> 
>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>> ---
>>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db58..28aeb7d08e 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     * All rights reserved.
>>>>>     */
>>>>>    +#include <rte_common.h>
>>>>>    #include <rte_string_fns.h>
>>>>>    #include <rte_mbuf.h>
>>>>>    #include <ethdev_driver.h>
>>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>>          volatile unsigned long rx_pkts;
>>>>>        volatile unsigned long rx_bytes;
>>>>> -};
>>>>> +} __rte_cache_aligned;
>>>>>    
>>>>
>>>> Latest location for '__rte_cache_aligned' tag is at the beginning of
>>>> the
>>>> struct [1], so something like:
>>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>>
>>>> [1]
>>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 10:28       ` Bruce Richardson
@ 2024-04-24 18:02         ` Ferruh Yigit
  0 siblings, 0 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-24 18:02 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville,
	dev, Tyler Retzlaff, Honnappa Nagarahalli

On 4/24/2024 11:28 AM, Bruce Richardson wrote:
> On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote:
>> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>>
>>>>> RX struct happens to be 64 bytes on x86_64 already, so cache
>>>>> alignment makes no change there, but it does on 32-bit ISAs.
>>>>>
>>>>> TX struct is 56 bytes on x86_64.
>>>>>
>>>>
>>>> Hi Mattias,
>>>>
>>>> No objection to the patch. Is the improvement theoretical or do you
>>>> measure any improvement practically, if so how much is the
>>>> improvement?
>>>>
>>>
>>> I didn't run any benchmarks.
>>>
>>> Two cores storing to a (falsely) shared cache line on a per-packet
>>> basis is going to be very expensive, at least for "light touch"
>>> applications.
>>>
>>
>> ack I expect for af_packet bottleneck is the kernel side, so I don't
>> expect any visible improvement practically, but OK to fix this
>> theoretical issue.
>>
>>>>> Both structs keep counters, and in the RX case they are updated even
>>>>> for empty polls.
>>>>>
>>>>
>>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>>> the loop?
>>>>
>>>
>>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>>> are declared volatile, so you are forcing a load-modify-store cycle for
>>> every increment.
>>>
>>
>> My intention was to prevent updating stats in empty polls, I thought
>> stats will be hot in the cache but won't work with volatile.
>>
> Yes, it will. Volatile only prevents caching in registers, it does not
> affect the storing of data within the cache hierarchy. Reads/writes of
> stats on empty polls should indeed hit the L1 as expected. However, that is
> still less efficient than just doing a register increment which could
> theoretically be the result without the volatile.
> 
> 

Thanks Bruce for clarification.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 17:50         ` Ferruh Yigit
@ 2024-04-24 19:13           ` Stephen Hemminger
  2024-04-24 22:27             ` Mattias Rönnblom
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2024-04-24 19:13 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville,
	dev, Tyler Retzlaff, Honnappa Nagarahalli

On Wed, 24 Apr 2024 18:50:50 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > I don't know how slow af_packet is, but if you care about performance,
> > you don't want to use atomic add for statistics.
> >   
> 
> There are a few soft drivers already using atomics adds for updating stats.
> If we document expectations from 'rte_eth_stats_reset()', we can update
> those usages.

Using atomic add is lots of extra overhead. The statistics are not guaranteed
to be perfect.  If nothing else, the bytes and packets can be skewed.

The soft drivers af_xdp, af_packet, and tun performance is dominated by the
overhead of the kernel system call and copies. Yes, alignment is good
but won't be noticeable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 19:13           ` Stephen Hemminger
@ 2024-04-24 22:27             ` Mattias Rönnblom
  2024-04-24 23:55               ` Stephen Hemminger
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-24 22:27 UTC (permalink / raw)
  To: Stephen Hemminger, Ferruh Yigit
  Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff,
	Honnappa Nagarahalli

On 2024-04-24 21:13, Stephen Hemminger wrote:
> On Wed, 24 Apr 2024 18:50:50 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>>> I don't know how slow af_packet is, but if you care about performance,
>>> you don't want to use atomic add for statistics.
>>>    
>>
>> There are a few soft drivers already using atomics adds for updating stats.
>> If we document expectations from 'rte_eth_stats_reset()', we can update
>> those usages.
> 
> Using atomic add is lots of extra overhead. The statistics are not guaranteed
> to be perfect.  If nothing else, the bytes and packets can be skewed.
> 

The sad thing here is that in case the counters are reset within the 
load-modify-store cycle of the lcore counter update, the reset may end 
up being a nop. So, it's not like you missed a packet or two, or suffer 
some transient inconsistency, but you completed and permanently ignored 
the reset request.

> The soft drivers af_xdp, af_packet, and tun performance is dominated by the
> overhead of the kernel system call and copies. Yes, alignment is good
> but won't be noticeable.

There aren't any syscalls in the RX path in the af_packet PMD.

I added the same statistics updates as the af_packet PMD uses into an 
benchmark app which consumes ~1000 cc in-between stats updates.

If the equivalent of the RX queue struct was cache aligned, the 
statistics overhead was so small it was difficult to measure. Less than 
3-4 cc per update. This was with volatile, but without atomics.

If the RX queue struct wasn't cache aligned, and sized so a cache line 
generally was used by two (neighboring) cores, the stats incurred a cost 
of ~55 cc per update.

Shaving off 55 cc should translate to a couple of hundred percent 
increased performance for an empty af_packet poll. If your lcore has 
some other primary source of work than the af_packet RX queue, and the 
RX queue is polled often, then this may well be a noticeable gain.

The benchmark was run on 16 Gracemont cores, which in my experience 
seems to have a little shorter core-to-core latency than many other 
systems, provided the remote core/cache line owner is located in the 
same cluster.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 22:27             ` Mattias Rönnblom
@ 2024-04-24 23:55               ` Stephen Hemminger
  2024-04-25  9:26                 ` Mattias Rönnblom
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2024-04-24 23:55 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev,
	Tyler Retzlaff, Honnappa Nagarahalli

On Thu, 25 Apr 2024 00:27:36 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-04-24 21:13, Stephen Hemminger wrote:
> > On Wed, 24 Apr 2024 18:50:50 +0100
> > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >   
> >>> I don't know how slow af_packet is, but if you care about performance,
> >>> you don't want to use atomic add for statistics.
> >>>      
> >>
> >> There are a few soft drivers already using atomics adds for updating stats.
> >> If we document expectations from 'rte_eth_stats_reset()', we can update
> >> those usages.  
> > 
> > Using atomic add is lots of extra overhead. The statistics are not guaranteed
> > to be perfect.  If nothing else, the bytes and packets can be skewed.
> >   
> 
> The sad thing here is that in case the counters are reset within the 
> load-modify-store cycle of the lcore counter update, the reset may end 
> up being a nop. So, it's not like you missed a packet or two, or suffer 
> some transient inconsistency, but you completed and permanently ignored 
> the reset request.

That is one of the many reasons the Linux kernel intentionally did not
implement a reset statistics operation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-24 23:55               ` Stephen Hemminger
@ 2024-04-25  9:26                 ` Mattias Rönnblom
  2024-04-25  9:49                   ` Morten Brørup
  2024-04-25 14:04                   ` Ferruh Yigit
  0 siblings, 2 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-25  9:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev,
	Tyler Retzlaff, Honnappa Nagarahalli

On 2024-04-25 01:55, Stephen Hemminger wrote:
> On Thu, 25 Apr 2024 00:27:36 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>    
>>>>> I don't know how slow af_packet is, but if you care about performance,
>>>>> you don't want to use atomic add for statistics.
>>>>>       
>>>>
>>>> There are a few soft drivers already using atomics adds for updating stats.
>>>> If we document expectations from 'rte_eth_stats_reset()', we can update
>>>> those usages.
>>>
>>> Using atomic add is lots of extra overhead. The statistics are not guaranteed
>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>    
>>
>> The sad thing here is that in case the counters are reset within the
>> load-modify-store cycle of the lcore counter update, the reset may end
>> up being a nop. So, it's not like you missed a packet or two, or suffer
>> some transient inconsistency, but you completed and permanently ignored
>> the reset request.
> 
> That is one of the many reasons the Linux kernel intentionally did not
> implement a reset statistics operation.

DPDK should deprecate statistics reset, it seems to me.

The current API is broken anyway, if you care about correctness. A reset 
function must return the current state of the counters, at the point 
them being reset. Otherwise, a higher layer may miss counter updates.

The af_packet PMD should return -ENOTSUP on reset (which is allowed by 
the API).

Maintaining an offset-since-last-reset for counters is a control plane 
thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed 
counters are to be expected from the PMDs, we should have some helper 
API to facilitate its efficient & correct-enough implementation.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25  9:26                 ` Mattias Rönnblom
@ 2024-04-25  9:49                   ` Morten Brørup
  2024-04-25 14:04                   ` Ferruh Yigit
  1 sibling, 0 replies; 32+ messages in thread
From: Morten Brørup @ 2024-04-25  9:49 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger
  Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev,
	Tyler Retzlaff, Honnappa Nagarahalli

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 25 April 2024 11.26
> 
> On 2024-04-25 01:55, Stephen Hemminger wrote:
> > On Thu, 25 Apr 2024 00:27:36 +0200
> > Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >
> >> On 2024-04-24 21:13, Stephen Hemminger wrote:
> >>> On Wed, 24 Apr 2024 18:50:50 +0100
> >>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >>>
> >>>>> I don't know how slow af_packet is, but if you care about performance,
> >>>>> you don't want to use atomic add for statistics.
> >>>>>
> >>>>
> >>>> There are a few soft drivers already using atomics adds for updating
> stats.
> >>>> If we document expectations from 'rte_eth_stats_reset()', we can update
> >>>> those usages.
> >>>
> >>> Using atomic add is lots of extra overhead. The statistics are not
> guaranteed
> >>> to be perfect.  If nothing else, the bytes and packets can be skewed.
> >>>
> >>
> >> The sad thing here is that in case the counters are reset within the
> >> load-modify-store cycle of the lcore counter update, the reset may end
> >> up being a nop. So, it's not like you missed a packet or two, or suffer
> >> some transient inconsistency, but you completed and permanently ignored
> >> the reset request.
> >
> > That is one of the many reasons the Linux kernel intentionally did not
> > implement a reset statistics operation.
> 
> DPDK should deprecate statistics reset, it seems to me.

+1

> 
> The current API is broken anyway, if you care about correctness. A reset
> function must return the current state of the counters, at the point
> them being reset. Otherwise, a higher layer may miss counter updates.
> 
> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
> the API).
> 
> Maintaining an offset-since-last-reset for counters is a control plane
> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> counters are to be expected from the PMDs, we should have some helper
> API to facilitate its efficient & correct-enough implementation.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25  9:26                 ` Mattias Rönnblom
  2024-04-25  9:49                   ` Morten Brørup
@ 2024-04-25 14:04                   ` Ferruh Yigit
  2024-04-25 15:06                     ` Mattias Rönnblom
  2024-04-25 15:07                     ` Stephen Hemminger
  1 sibling, 2 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-25 14:04 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger
  Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff,
	Honnappa Nagarahalli, Morten Brørup

On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
> On 2024-04-25 01:55, Stephen Hemminger wrote:
>> On Thu, 25 Apr 2024 00:27:36 +0200
>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>   
>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>> performance,
>>>>>> you don't want to use atomic add for statistics.
>>>>>>       
>>>>>
>>>>> There are a few soft drivers already using atomics adds for
>>>>> updating stats.
>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>> update
>>>>> those usages.
>>>>
>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>> guaranteed
>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>    
>>>
>>> The sad thing here is that in case the counters are reset within the
>>> load-modify-store cycle of the lcore counter update, the reset may end
>>> up being a nop. So, it's not like you missed a packet or two, or suffer
>>> some transient inconsistency, but you completed and permanently ignored
>>> the reset request.
>>
>> That is one of the many reasons the Linux kernel intentionally did not
>> implement a reset statistics operation.
> 
> DPDK should deprecate statistics reset, it seems to me.
> 
> The current API is broken anyway, if you care about correctness. A reset
> function must return the current state of the counters, at the point
> them being reset. Otherwise, a higher layer may miss counter updates.
> 
> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
> the API).
> 
> Maintaining an offset-since-last-reset for counters is a control plane
> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> counters are to be expected from the PMDs, we should have some helper
> API to facilitate its efficient & correct-enough implementation.)
>

statistics reset works for HW devices, instead of removing statics reset
I am for documenting API that it may be not reliable, I can send a patch
for it.

With above change, we can be more relax on stats update specially for
soft drivers, and can convert atomic_add stats updates to "atomic load +
add + atomic store".

Does this plan make sense?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23 11:15 ` Ferruh Yigit
  2024-04-23 20:56   ` Mattias Rönnblom
@ 2024-04-25 14:08   ` Ferruh Yigit
  2024-04-25 15:08     ` Mattias Rönnblom
  1 sibling, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-25 14:08 UTC (permalink / raw)
  To: Mattias Rönnblom, John W . Linville; +Cc: dev, hofors, Tyler Retzlaff

On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db58..28aeb7d08e 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -6,6 +6,7 @@
>>   * All rights reserved.
>>   */
>>  
>> +#include <rte_common.h>
>>  #include <rte_string_fns.h>
>>  #include <rte_mbuf.h>
>>  #include <ethdev_driver.h>
>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>  
>>  	volatile unsigned long rx_pkts;
>>  	volatile unsigned long rx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>  
> Latest location for '__rte_cache_aligned' tag is at the beginning of the
> struct [1], so something like:
> `struct __rte_cache_aligned pkt_rx_queue {`
> 
> [1]
> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>

Hi Mattias,

We dived into side discussions but with above change I can proceed with
the patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 14:04                   ` Ferruh Yigit
@ 2024-04-25 15:06                     ` Mattias Rönnblom
  2024-04-25 16:21                       ` Ferruh Yigit
  2024-04-25 15:07                     ` Stephen Hemminger
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-25 15:06 UTC (permalink / raw)
  To: Ferruh Yigit, Stephen Hemminger
  Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff,
	Honnappa Nagarahalli, Morten Brørup

On 2024-04-25 16:04, Ferruh Yigit wrote:
> On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
>> On 2024-04-25 01:55, Stephen Hemminger wrote:
>>> On Thu, 25 Apr 2024 00:27:36 +0200
>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>
>>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>    
>>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>>> performance,
>>>>>>> you don't want to use atomic add for statistics.
>>>>>>>        
>>>>>>
>>>>>> There are a few soft drivers already using atomics adds for
>>>>>> updating stats.
>>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>>> update
>>>>>> those usages.
>>>>>
>>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>>> guaranteed
>>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>>     
>>>>
>>>> The sad thing here is that in case the counters are reset within the
>>>> load-modify-store cycle of the lcore counter update, the reset may end
>>>> up being a nop. So, it's not like you missed a packet or two, or suffer
>>>> some transient inconsistency, but you completed and permanently ignored
>>>> the reset request.
>>>
>>> That is one of the many reasons the Linux kernel intentionally did not
>>> implement a reset statistics operation.
>>
>> DPDK should deprecate statistics reset, it seems to me.
>>
>> The current API is broken anyway, if you care about correctness. A reset
>> function must return the current state of the counters, at the point
>> them being reset. Otherwise, a higher layer may miss counter updates.
>>
>> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
>> the API).
>>
>> Maintaining an offset-since-last-reset for counters is a control plane
>> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
>> counters are to be expected from the PMDs, we should have some helper
>> API to facilitate its efficient & correct-enough implementation.)
>>
> 
> statistics reset works for HW devices, instead of removing statics reset
> I am for documenting API that it may be not reliable, I can send a patch
> for it.
> 

With API you mean <rte_ethdev.h>?

If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes 
doesn't, it should also have a name that reflect those semantics.
rte_ethdev_stats_reset_or_perhaps_not()
rte_ethdev_stats_usually_reset()

Rather than expecting the application to deal with unspecified 
non-determinism is seems better to specify under which conditions the 
reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will 
limit it usefulness though. Also, it will make having an MT safe reset 
in a PMD pretty useless, except if the app is programmed not toward the 
API, but toward some particular PMD.

> With above change, we can be more relax on stats update specially for
> soft drivers, and can convert atomic_add stats updates to "atomic load +
> add + atomic store".
> 
> Does this plan make sense?

Not really.

Short-term the -ENOTSUP seems like the best option. Second best option 
is to implement a proper MT safe reset.

What is unfortunate is that the API is silent on MT safety. I've 
*assumed* that many users will have assumed it MT safe, but there's 
nothing in the documentation to support that. Rather the opposite, the 
generic section of DPDK MT safety only mentions PMD RX and TX functions.

This issue is not limited to this PMD or even to ethdev. 
rte_event_dev_xstats_reset() and some of the event devs have the same 
problem.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 14:04                   ` Ferruh Yigit
  2024-04-25 15:06                     ` Mattias Rönnblom
@ 2024-04-25 15:07                     ` Stephen Hemminger
  1 sibling, 0 replies; 32+ messages in thread
From: Stephen Hemminger @ 2024-04-25 15:07 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville,
	dev, Tyler Retzlaff, Honnappa Nagarahalli, Morten Brørup

On Thu, 25 Apr 2024 15:04:27 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > Maintaining an offset-since-last-reset for counters is a control plane
> > thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> > counters are to be expected from the PMDs, we should have some helper
> > API to facilitate its efficient & correct-enough implementation.)
> >  
> 
> statistics reset works for HW devices, instead of removing statics reset
> I am for documenting API that it may be not reliable, I can send a patch
> for it.
> 
> With above change, we can be more relax on stats update specially for
> soft drivers, and can convert atomic_add stats updates to "atomic load +
> add + atomic store".
> 
> Does this plan make sense?


+1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 14:08   ` Ferruh Yigit
@ 2024-04-25 15:08     ` Mattias Rönnblom
  2024-04-25 15:35       ` Ferruh Yigit
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-25 15:08 UTC (permalink / raw)
  To: Ferruh Yigit, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff

On 2024-04-25 16:08, Ferruh Yigit wrote:
> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>    * All rights reserved.
>>>    */
>>>   
>>> +#include <rte_common.h>
>>>   #include <rte_string_fns.h>
>>>   #include <rte_mbuf.h>
>>>   #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>   
>>>   	volatile unsigned long rx_pkts;
>>>   	volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>   
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>>
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>
> 
> Hi Mattias,
> 
> We dived into side discussions but with above change I can proceed with
> the patch.

OK.

Should this go into some stable branch as well?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 15:08     ` Mattias Rönnblom
@ 2024-04-25 15:35       ` Ferruh Yigit
  2024-04-26  7:25         ` Mattias Rönnblom
  0 siblings, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-25 15:35 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff

On 4/25/2024 4:08 PM, Mattias Rönnblom wrote:
> On 2024-04-25 16:08, Ferruh Yigit wrote:
>> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>    * All rights reserved.
>>>>    */
>>>>   +#include <rte_common.h>
>>>>   #include <rte_string_fns.h>
>>>>   #include <rte_mbuf.h>
>>>>   #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>         volatile unsigned long rx_pkts;
>>>>       volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>   
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>>
>>
>> Hi Mattias,
>>
>> We dived into side discussions but with above change I can proceed with
>> the patch.
> 
> OK.
> 
> Should this go into some stable branch as well?
>

I don't see any reason for not merging to stable trees. For this please
add fixes and stable tags.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 15:06                     ` Mattias Rönnblom
@ 2024-04-25 16:21                       ` Ferruh Yigit
  0 siblings, 0 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-25 16:21 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger
  Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff,
	Honnappa Nagarahalli, Morten Brørup

On 4/25/2024 4:06 PM, Mattias Rönnblom wrote:
> On 2024-04-25 16:04, Ferruh Yigit wrote:
>> On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
>>> On 2024-04-25 01:55, Stephen Hemminger wrote:
>>>> On Thu, 25 Apr 2024 00:27:36 +0200
>>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>>   
>>>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>>>> performance,
>>>>>>>> you don't want to use atomic add for statistics.
>>>>>>>>        
>>>>>>>
>>>>>>> There are a few soft drivers already using atomics adds for
>>>>>>> updating stats.
>>>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>>>> update
>>>>>>> those usages.
>>>>>>
>>>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>>>> guaranteed
>>>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>>>     
>>>>>
>>>>> The sad thing here is that in case the counters are reset within the
>>>>> load-modify-store cycle of the lcore counter update, the reset may end
>>>>> up being a nop. So, it's not like you missed a packet or two, or
>>>>> suffer
>>>>> some transient inconsistency, but you completed and permanently
>>>>> ignored
>>>>> the reset request.
>>>>
>>>> That is one of the many reasons the Linux kernel intentionally did not
>>>> implement a reset statistics operation.
>>>
>>> DPDK should deprecate statistics reset, it seems to me.
>>>
>>> The current API is broken anyway, if you care about correctness. A reset
>>> function must return the current state of the counters, at the point
>>> them being reset. Otherwise, a higher layer may miss counter updates.
>>>
>>> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
>>> the API).
>>>
>>> Maintaining an offset-since-last-reset for counters is a control plane
>>> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
>>> counters are to be expected from the PMDs, we should have some helper
>>> API to facilitate its efficient & correct-enough implementation.)
>>>
>>
>> statistics reset works for HW devices, instead of removing statics reset
>> I am for documenting API that it may be not reliable, I can send a patch
>> for it.
>>
> 
> With API you mean <rte_ethdev.h>?
> 
> If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes
> doesn't, it should also have a name that reflect those semantics.
> rte_ethdev_stats_reset_or_perhaps_not()
> rte_ethdev_stats_usually_reset()
> 

:) point taken

> Rather than expecting the application to deal with unspecified
> non-determinism is seems better to specify under which conditions the
> reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will
> limit it usefulness though. Also, it will make having an MT safe reset
> in a PMD pretty useless, except if the app is programmed not toward the
> API, but toward some particular PMD.
> 
>> With above change, we can be more relax on stats update specially for
>> soft drivers, and can convert atomic_add stats updates to "atomic load +
>> add + atomic store".
>>
>> Does this plan make sense?
> 
> Not really.
> 
> Short-term the -ENOTSUP seems like the best option. Second best option
> is to implement a proper MT safe reset.
> 
> What is unfortunate is that the API is silent on MT safety. I've
> *assumed* that many users will have assumed it MT safe, but there's
> nothing in the documentation to support that. Rather the opposite, the
> generic section of DPDK MT safety only mentions PMD RX and TX functions.
> 
> This issue is not limited to this PMD or even to ethdev.
> rte_event_dev_xstats_reset() and some of the event devs have the same
> problem.
>

True, and we can document multi thread safety of APIs at least, this is
small effort.

For reliable stats reset, implementing offset logic is not that hard,
and enables having more optimized stats update in datapath.
We can go with this option instead of unreliable stats reset.

I can add this at least for af_packet as sample.
As far as I can see, with this update we can get rid of volatile
qualifier and atomic set for stats variables.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-25 15:35       ` Ferruh Yigit
@ 2024-04-26  7:25         ` Mattias Rönnblom
  0 siblings, 0 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-26  7:25 UTC (permalink / raw)
  To: Ferruh Yigit, Mattias Rönnblom, John W . Linville
  Cc: dev, Tyler Retzlaff

On 2024-04-25 17:35, Ferruh Yigit wrote:
> On 4/25/2024 4:08 PM, Mattias Rönnblom wrote:
>> On 2024-04-25 16:08, Ferruh Yigit wrote:
>>> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>> ---
>>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db58..28aeb7d08e 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     * All rights reserved.
>>>>>     */
>>>>>    +#include <rte_common.h>
>>>>>    #include <rte_string_fns.h>
>>>>>    #include <rte_mbuf.h>
>>>>>    #include <ethdev_driver.h>
>>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>>          volatile unsigned long rx_pkts;
>>>>>        volatile unsigned long rx_bytes;
>>>>> -};
>>>>> +} __rte_cache_aligned;
>>>>>    
>>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>>> struct [1], so something like:
>>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>>
>>>> [1]
>>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>>>
>>>
>>> Hi Mattias,
>>>
>>> We dived into side discussions but with above change I can proceed with
>>> the patch.
>>
>> OK.
>>
>> Should this go into some stable branch as well?
>>
> 
> I don't see any reason for not merging to stable trees. For this please
> add fixes and stable tags.

OK, I'll submit a v2. Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23  9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom
  2024-04-23 11:15 ` Ferruh Yigit
@ 2024-04-26  7:38 ` Mattias Rönnblom
  2024-04-26  8:27   ` Ferruh Yigit
  2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
  2024-04-26 21:27 ` [PATCH] " Patrick Robb
  2 siblings, 2 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-26  7:38 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon,
	John W . Linville, Ferruh Yigit, Mattias Rönnblom, stable

Cache align Rx and Tx queue struct to avoid false sharing.

RX struct happens to be 64 bytes on x86_64 already, so cache alignment
has no effect there, but it does on 32-bit ISAs.

TX struct is 56 bytes on x86_64.

Both structs keep counters, and in the RX case they are updated even
for empty polls.

Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
Cc: stable@dpdk.org

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db58..28aeb7d08e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@
  * All rights reserved.
  */
 
+#include <rte_common.h>
 #include <rte_string_fns.h>
 #include <rte_mbuf.h>
 #include <ethdev_driver.h>
@@ -53,7 +54,7 @@ struct pkt_rx_queue {
 
 	volatile unsigned long rx_pkts;
 	volatile unsigned long rx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pkt_tx_queue {
 	int sockfd;
@@ -67,7 +68,7 @@ struct pkt_tx_queue {
 	volatile unsigned long tx_pkts;
 	volatile unsigned long err_pkts;
 	volatile unsigned long tx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pmd_internals {
 	unsigned nb_queues;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-26  7:38 ` Mattias Rönnblom
@ 2024-04-26  8:27   ` Ferruh Yigit
  2024-04-26 10:20     ` Mattias Rönnblom
  2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
  1 sibling, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-26  8:27 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon,
	John W . Linville, stable

On 4/26/2024 8:38 AM, Mattias Rönnblom wrote:
> Cache align Rx and Tx queue struct to avoid false sharing.
> 
> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
> has no effect there, but it does on 32-bit ISAs.
> 
> TX struct is 56 bytes on x86_64.
> 
> Both structs keep counters, and in the RX case they are updated even
> for empty polls.
> 
> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index 397a32db58..28aeb7d08e 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -6,6 +6,7 @@
>   * All rights reserved.
>   */
>  
> +#include <rte_common.h>
>  #include <rte_string_fns.h>
>  #include <rte_mbuf.h>
>  #include <ethdev_driver.h>
> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>  
>  	volatile unsigned long rx_pkts;
>  	volatile unsigned long rx_bytes;
> -};
> +} __rte_cache_aligned;
>  
>  struct pkt_tx_queue {
>  	int sockfd;
> @@ -67,7 +68,7 @@ struct pkt_tx_queue {
>  	volatile unsigned long tx_pkts;
>  	volatile unsigned long err_pkts;
>  	volatile unsigned long tx_bytes;
> -};
> +} __rte_cache_aligned;
>  
>  struct pmd_internals {
>  	unsigned nb_queues;

Hi Mattias,

I guess my comment missed, location of '__rte_cache_aligned' tag changed
for MSVC support, it should be like:
```
struct __rte_cache_aligned pkt_rx_queue {
...
};
```

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3] net/af_packet: cache align Rx/Tx structs
  2024-04-26  7:38 ` Mattias Rönnblom
  2024-04-26  8:27   ` Ferruh Yigit
@ 2024-04-26  9:05   ` Mattias Rönnblom
  2024-04-26  9:22     ` Morten Brørup
                       ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-26  9:05 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon,
	John W . Linville, Ferruh Yigit, Mattias Rönnblom, stable

Cache align Rx and Tx queue struct to avoid false sharing.

The RX struct happens to be 64 bytes on x86_64 already, so cache
alignment has no effect there, but it does on 32-bit ISAs.

The TX struct is 56 bytes on x86_64.

Both structs keep counters, and in the RX case they are updated even
for empty polls.

v3: Move __rte_cache_aligned directive to a MSVC-compatible location.

Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
Cc: stable@dpdk.org

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db58..6b7b16f348 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@
  * All rights reserved.
  */
 
+#include <rte_common.h>
 #include <rte_string_fns.h>
 #include <rte_mbuf.h>
 #include <ethdev_driver.h>
@@ -39,7 +40,7 @@
 #define DFLT_FRAME_SIZE		(1 << 11)
 #define DFLT_FRAME_COUNT	(1 << 9)
 
-struct pkt_rx_queue {
+struct __rte_cache_aligned pkt_rx_queue {
 	int sockfd;
 
 	struct iovec *rd;
@@ -55,7 +56,7 @@ struct pkt_rx_queue {
 	volatile unsigned long rx_bytes;
 };
 
-struct pkt_tx_queue {
+struct __rte_cache_aligned pkt_tx_queue {
 	int sockfd;
 	unsigned int frame_data_size;
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH v3] net/af_packet: cache align Rx/Tx structs
  2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
@ 2024-04-26  9:22     ` Morten Brørup
  2024-04-26 15:10     ` Stephen Hemminger
  2024-04-26 15:41     ` Tyler Retzlaff
  2 siblings, 0 replies; 32+ messages in thread
From: Morten Brørup @ 2024-04-26  9:22 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Thomas Monjalon, John W . Linville,
	Ferruh Yigit, stable

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Friday, 26 April 2024 11.05
> 
> Cache align Rx and Tx queue struct to avoid false sharing.

Should an RTE_CACHE_GUARD be added at the end of these two structs?

If not,
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-26  8:27   ` Ferruh Yigit
@ 2024-04-26 10:20     ` Mattias Rönnblom
  0 siblings, 0 replies; 32+ messages in thread
From: Mattias Rönnblom @ 2024-04-26 10:20 UTC (permalink / raw)
  To: Ferruh Yigit, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Thomas Monjalon,
	John W . Linville, stable

On 2024-04-26 10:27, Ferruh Yigit wrote:
> On 4/26/2024 8:38 AM, Mattias Rönnblom wrote:
>> Cache align Rx and Tx queue struct to avoid false sharing.
>>
>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>> has no effect there, but it does on 32-bit ISAs.
>>
>> TX struct is 56 bytes on x86_64.
>>
>> Both structs keep counters, and in the RX case they are updated even
>> for empty polls.
>>
>> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db58..28aeb7d08e 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -6,6 +6,7 @@
>>    * All rights reserved.
>>    */
>>   
>> +#include <rte_common.h>
>>   #include <rte_string_fns.h>
>>   #include <rte_mbuf.h>
>>   #include <ethdev_driver.h>
>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>   
>>   	volatile unsigned long rx_pkts;
>>   	volatile unsigned long rx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>   
>>   struct pkt_tx_queue {
>>   	int sockfd;
>> @@ -67,7 +68,7 @@ struct pkt_tx_queue {
>>   	volatile unsigned long tx_pkts;
>>   	volatile unsigned long err_pkts;
>>   	volatile unsigned long tx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>   
>>   struct pmd_internals {
>>   	unsigned nb_queues;
> 
> Hi Mattias,
> 
> I guess my comment missed, location of '__rte_cache_aligned' tag changed
> for MSVC support, it should be like:
> ```
> struct __rte_cache_aligned pkt_rx_queue {
> ...
> };
> ```

Right. Sorry, forgot about that.

I've made a v3 (which is actually marked v3).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs
  2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
  2024-04-26  9:22     ` Morten Brørup
@ 2024-04-26 15:10     ` Stephen Hemminger
  2024-04-26 15:41     ` Tyler Retzlaff
  2 siblings, 0 replies; 32+ messages in thread
From: Stephen Hemminger @ 2024-04-26 15:10 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Thomas Monjalon,
	John W . Linville, Ferruh Yigit, stable

On Fri, 26 Apr 2024 11:05:02 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> Cache align Rx and Tx queue struct to avoid false sharing.
> 
> The RX struct happens to be 64 bytes on x86_64 already, so cache
> alignment has no effect there, but it does on 32-bit ISAs.
> 
> The TX struct is 56 bytes on x86_64.
> 
> Both structs keep counters, and in the RX case they are updated even
> for empty polls.
> 
> v3: Move __rte_cache_aligned directive to a MSVC-compatible location.
> 
> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Looks good.

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

If you want to go further with this:
   - could rx and tx rings be next to each other so that any of the lookahead
     in cpu helps, not hurts?
   - what about secondary process support
   - would sendmmsg() and recvmmsg() help for bursting.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs
  2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
  2024-04-26  9:22     ` Morten Brørup
  2024-04-26 15:10     ` Stephen Hemminger
@ 2024-04-26 15:41     ` Tyler Retzlaff
  2024-04-29  8:46       ` Ferruh Yigit
  2 siblings, 1 reply; 32+ messages in thread
From: Tyler Retzlaff @ 2024-04-26 15:41 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Thomas Monjalon, John W . Linville, Ferruh Yigit, stable

On Fri, Apr 26, 2024 at 11:05:02AM +0200, Mattias Rönnblom wrote:
> Cache align Rx and Tx queue struct to avoid false sharing.
> 
> The RX struct happens to be 64 bytes on x86_64 already, so cache
> alignment has no effect there, but it does on 32-bit ISAs.
> 
> The TX struct is 56 bytes on x86_64.
> 
> Both structs keep counters, and in the RX case they are updated even
> for empty polls.
> 
> v3: Move __rte_cache_aligned directive to a MSVC-compatible location.
> 
> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] net/af_packet: cache align Rx/Tx structs
  2024-04-23  9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom
  2024-04-23 11:15 ` Ferruh Yigit
  2024-04-26  7:38 ` Mattias Rönnblom
@ 2024-04-26 21:27 ` Patrick Robb
  2 siblings, 0 replies; 32+ messages in thread
From: Patrick Robb @ 2024-04-26 21:27 UTC (permalink / raw)
  To: Mattias Rönnblom; +Cc: John W . Linville, dev, hofors

[-- Attachment #1: Type: text/plain, Size: 296 bytes --]

Recheck-request: iol-compile-amd64-testing

The DPDK Community Lab updated to the latest Alpine image yesterday, which
resulted in all Alpine builds failing. The failure is unrelated to your
patch, and this recheck should remove the fail on Patchwork, as we have
disabled Alpine testing for now.

[-- Attachment #2: Type: text/html, Size: 361 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs
  2024-04-26 15:41     ` Tyler Retzlaff
@ 2024-04-29  8:46       ` Ferruh Yigit
  0 siblings, 0 replies; 32+ messages in thread
From: Ferruh Yigit @ 2024-04-29  8:46 UTC (permalink / raw)
  To: Tyler Retzlaff, Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Thomas Monjalon, John W . Linville, stable

On 4/26/2024 4:41 PM, Tyler Retzlaff wrote:
> On Fri, Apr 26, 2024 at 11:05:02AM +0200, Mattias Rönnblom wrote:
>> Cache align Rx and Tx queue struct to avoid false sharing.
>>
>> The RX struct happens to be 64 bytes on x86_64 already, so cache
>> alignment has no effect there, but it does on 32-bit ISAs.
>>
>> The TX struct is 56 bytes on x86_64.
>>
>> Both structs keep counters, and in the RX case they are updated even
>> for empty polls.
>>
>> v3: Move __rte_cache_aligned directive to a MSVC-compatible location.
>>
>> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices")
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> 

Applied to dpdk-next-net/main, thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-04-29  8:46 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-23  9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom
2024-04-23 11:15 ` Ferruh Yigit
2024-04-23 20:56   ` Mattias Rönnblom
2024-04-24  0:27     ` Honnappa Nagarahalli
2024-04-24  6:28       ` Mattias Rönnblom
2024-04-24 10:21     ` Ferruh Yigit
2024-04-24 10:28       ` Bruce Richardson
2024-04-24 18:02         ` Ferruh Yigit
2024-04-24 11:57       ` Mattias Rönnblom
2024-04-24 17:50         ` Ferruh Yigit
2024-04-24 19:13           ` Stephen Hemminger
2024-04-24 22:27             ` Mattias Rönnblom
2024-04-24 23:55               ` Stephen Hemminger
2024-04-25  9:26                 ` Mattias Rönnblom
2024-04-25  9:49                   ` Morten Brørup
2024-04-25 14:04                   ` Ferruh Yigit
2024-04-25 15:06                     ` Mattias Rönnblom
2024-04-25 16:21                       ` Ferruh Yigit
2024-04-25 15:07                     ` Stephen Hemminger
2024-04-25 14:08   ` Ferruh Yigit
2024-04-25 15:08     ` Mattias Rönnblom
2024-04-25 15:35       ` Ferruh Yigit
2024-04-26  7:25         ` Mattias Rönnblom
2024-04-26  7:38 ` Mattias Rönnblom
2024-04-26  8:27   ` Ferruh Yigit
2024-04-26 10:20     ` Mattias Rönnblom
2024-04-26  9:05   ` [PATCH v3] " Mattias Rönnblom
2024-04-26  9:22     ` Morten Brørup
2024-04-26 15:10     ` Stephen Hemminger
2024-04-26 15:41     ` Tyler Retzlaff
2024-04-29  8:46       ` Ferruh Yigit
2024-04-26 21:27 ` [PATCH] " Patrick Robb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).