* [PATCH] net/af_packet: cache align Rx/Tx structs @ 2024-04-23 9:08 Mattias Rönnblom 2024-04-23 11:15 ` Ferruh Yigit ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-23 9:08 UTC (permalink / raw) To: John W . Linville; +Cc: dev, hofors, Mattias Rönnblom Cache align Rx and Tx queue struct to avoid false sharing. RX struct happens to be 64 bytes on x86_64 already, so cache alignment makes no change there, but it does on 32-bit ISAs. TX struct is 56 bytes on x86_64. Both structs keep counters, and in the RX case they are updated even for empty polls. Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 397a32db58..28aeb7d08e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -6,6 +6,7 @@ * All rights reserved. */ +#include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> #include <ethdev_driver.h> @@ -53,7 +54,7 @@ struct pkt_rx_queue { volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; -}; +} __rte_cache_aligned; struct pkt_tx_queue { int sockfd; @@ -67,7 +68,7 @@ struct pkt_tx_queue { volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; -}; +} __rte_cache_aligned; struct pmd_internals { unsigned nb_queues; -- 2.34.1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom @ 2024-04-23 11:15 ` Ferruh Yigit 2024-04-23 20:56 ` Mattias Rönnblom 2024-04-25 14:08 ` Ferruh Yigit 2024-04-26 7:38 ` Mattias Rönnblom 2024-04-26 21:27 ` [PATCH] " Patrick Robb 2 siblings, 2 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-23 11:15 UTC (permalink / raw) To: Mattias Rönnblom, John W . Linville; +Cc: dev, hofors, Tyler Retzlaff On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: > Cache align Rx and Tx queue struct to avoid false sharing. > > RX struct happens to be 64 bytes on x86_64 already, so cache alignment > makes no change there, but it does on 32-bit ISAs. > > TX struct is 56 bytes on x86_64. > Hi Mattias, No objection to the patch. Is the improvement theoretical or do you measure any improvement practically, if so how much is the improvement? > Both structs keep counters, and in the RX case they are updated even > for empty polls. > Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within the loop? > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> > --- > drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c > index 397a32db58..28aeb7d08e 100644 > --- a/drivers/net/af_packet/rte_eth_af_packet.c > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > @@ -6,6 +6,7 @@ > * All rights reserved. > */ > > +#include <rte_common.h> > #include <rte_string_fns.h> > #include <rte_mbuf.h> > #include <ethdev_driver.h> > @@ -53,7 +54,7 @@ struct pkt_rx_queue { > > volatile unsigned long rx_pkts; > volatile unsigned long rx_bytes; > -}; > +} __rte_cache_aligned; > Latest location for '__rte_cache_aligned' tag is at the beginning of the struct [1], so something like: `struct __rte_cache_aligned pkt_rx_queue {` [1] https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 11:15 ` Ferruh Yigit @ 2024-04-23 20:56 ` Mattias Rönnblom 2024-04-24 0:27 ` Honnappa Nagarahalli 2024-04-24 10:21 ` Ferruh Yigit 2024-04-25 14:08 ` Ferruh Yigit 1 sibling, 2 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-23 20:56 UTC (permalink / raw) To: Ferruh Yigit, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff On 2024-04-23 13:15, Ferruh Yigit wrote: > On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >> Cache align Rx and Tx queue struct to avoid false sharing. >> >> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >> makes no change there, but it does on 32-bit ISAs. >> >> TX struct is 56 bytes on x86_64. >> > > Hi Mattias, > > No objection to the patch. Is the improvement theoretical or do you > measure any improvement practically, if so how much is the improvement? > I didn't run any benchmarks. Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications. >> Both structs keep counters, and in the RX case they are updated even >> for empty polls. >> > > Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within > the loop? > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment. I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store). >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >> --- >> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >> index 397a32db58..28aeb7d08e 100644 >> --- a/drivers/net/af_packet/rte_eth_af_packet.c >> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >> @@ -6,6 +6,7 @@ >> * All rights reserved. >> */ >> >> +#include <rte_common.h> >> #include <rte_string_fns.h> >> #include <rte_mbuf.h> >> #include <ethdev_driver.h> >> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >> >> volatile unsigned long rx_pkts; >> volatile unsigned long rx_bytes; >> -}; >> +} __rte_cache_aligned; >> > > Latest location for '__rte_cache_aligned' tag is at the beginning of the > struct [1], so something like: > `struct __rte_cache_aligned pkt_rx_queue {` > > [1] > https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 20:56 ` Mattias Rönnblom @ 2024-04-24 0:27 ` Honnappa Nagarahalli 2024-04-24 6:28 ` Mattias Rönnblom 2024-04-24 10:21 ` Ferruh Yigit 1 sibling, 1 reply; 32+ messages in thread From: Honnappa Nagarahalli @ 2024-04-24 0:27 UTC (permalink / raw) To: Mattias Rönnblom Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, nd, Honnappa Nagarahalli > On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote: > > On 2024-04-23 13:15, Ferruh Yigit wrote: >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>> Cache align Rx and Tx queue struct to avoid false sharing. >>> >>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>> makes no change there, but it does on 32-bit ISAs. >>> >>> TX struct is 56 bytes on x86_64. >>> >> Hi Mattias, >> No objection to the patch. Is the improvement theoretical or do you >> measure any improvement practically, if so how much is the improvement? > > I didn't run any benchmarks. > > Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications. > >>> Both structs keep counters, and in the RX case they are updated even >>> for empty polls. >>> >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >> the loop? > > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment. > > I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store). (Slightly unrelated discussion) Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed. > >>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >>> index 397a32db58..28aeb7d08e 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -6,6 +6,7 @@ >>> * All rights reserved. >>> */ >>> +#include <rte_common.h> >>> #include <rte_string_fns.h> >>> #include <rte_mbuf.h> >>> #include <ethdev_driver.h> >>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>> volatile unsigned long rx_pkts; >>> volatile unsigned long rx_bytes; >>> -}; >>> +} __rte_cache_aligned; >>> >> Latest location for '__rte_cache_aligned' tag is at the beginning of the >> struct [1], so something like: >> `struct __rte_cache_aligned pkt_rx_queue {` >> [1] >> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 0:27 ` Honnappa Nagarahalli @ 2024-04-24 6:28 ` Mattias Rönnblom 0 siblings, 0 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-24 6:28 UTC (permalink / raw) To: Honnappa Nagarahalli Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, nd On 2024-04-24 02:27, Honnappa Nagarahalli wrote: > > >> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote: >> >> On 2024-04-23 13:15, Ferruh Yigit wrote: >>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>> >>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>>> makes no change there, but it does on 32-bit ISAs. >>>> >>>> TX struct is 56 bytes on x86_64. >>>> >>> Hi Mattias, >>> No objection to the patch. Is the improvement theoretical or do you >>> measure any improvement practically, if so how much is the improvement? >> >> I didn't run any benchmarks. >> >> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications. >> >>>> Both structs keep counters, and in the RX case they are updated even >>>> for empty polls. >>>> >>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>> the loop? >> >> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment. >> >> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store). > (Slightly unrelated discussion) > Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed. > Is it "far atomics" you have in mind? A C11 complaint compiler won't generate STADD (even for relaxed type ops), since it doesn't fit well into the C11 memory model. In particular, the issue is that STADD isn't ordered by the instructions generated by C11 fences, from what I understand. GCC did generate STADD for a while, until this issue was discovered. On x86_64, ADD <addr> is generally much faster than LOCK ADD <addr>. No wonder, since the latter is a full barrier. If I'd worked for ARM I would probably had proposed some API extension to DPDK atomics API to allow the use of STADD (through inline assembler). One way to integrate it might be to add a new memory model, rte_memory_order_unordered or rte_memory_order_jumps_all_fences, where loads and stores are totally unordered. However, in this particular case, it's a single-writer scenario, so rte_memory_order_kangaroo wouldn't really help, since it would be equivalent to rte_memory_order_relaxed on non-far atomics-capable ISAs, which would be too expensive. In the past I've argued for a single-writer version of atomic add/sub etc in the DPDK atomics API (whatever that is), for convenience and to make the issue/pattern known. >> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>>> --- >>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >>>> index 397a32db58..28aeb7d08e 100644 >>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>> @@ -6,6 +6,7 @@ >>>> * All rights reserved. >>>> */ >>>> +#include <rte_common.h> >>>> #include <rte_string_fns.h> >>>> #include <rte_mbuf.h> >>>> #include <ethdev_driver.h> >>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>> volatile unsigned long rx_pkts; >>>> volatile unsigned long rx_bytes; >>>> -}; >>>> +} __rte_cache_aligned; >>>> >>> Latest location for '__rte_cache_aligned' tag is at the beginning of the >>> struct [1], so something like: >>> `struct __rte_cache_aligned pkt_rx_queue {` >>> [1] >>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 20:56 ` Mattias Rönnblom 2024-04-24 0:27 ` Honnappa Nagarahalli @ 2024-04-24 10:21 ` Ferruh Yigit 2024-04-24 10:28 ` Bruce Richardson 2024-04-24 11:57 ` Mattias Rönnblom 1 sibling, 2 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-24 10:21 UTC (permalink / raw) To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: > On 2024-04-23 13:15, Ferruh Yigit wrote: >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>> Cache align Rx and Tx queue struct to avoid false sharing. >>> >>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>> makes no change there, but it does on 32-bit ISAs. >>> >>> TX struct is 56 bytes on x86_64. >>> >> >> Hi Mattias, >> >> No objection to the patch. Is the improvement theoretical or do you >> measure any improvement practically, if so how much is the improvement? >> > > I didn't run any benchmarks. > > Two cores storing to a (falsely) shared cache line on a per-packet basis > is going to be very expensive, at least for "light touch" applications. > ack I expect for af_packet bottleneck is the kernel side, so I don't expect any visible improvement practically, but OK to fix this theoretical issue. >>> Both structs keep counters, and in the RX case they are updated even >>> for empty polls. >>> >> >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >> the loop? >> > > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes > are declared volatile, so you are forcing a load-modify-store cycle for > every increment. > My intention was to prevent updating stats in empty polls, I thought stats will be hot in the cache but won't work with volatile. > I would drop "volatile", or replace it with an atomic (although *not* > use an atomic add for incrementing, but rather atomic load + <n> > non-atomic adds + atomic store). > As only single core polls from an Rx queue, I assume atomics is required only for stats reset case. And for that case won't we need atomic add? And do you know if there is a performance difference between atomic add and keeping volatile qualifier? >>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>> b/drivers/net/af_packet/rte_eth_af_packet.c >>> index 397a32db58..28aeb7d08e 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -6,6 +6,7 @@ >>> * All rights reserved. >>> */ >>> +#include <rte_common.h> >>> #include <rte_string_fns.h> >>> #include <rte_mbuf.h> >>> #include <ethdev_driver.h> >>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>> volatile unsigned long rx_pkts; >>> volatile unsigned long rx_bytes; >>> -}; >>> +} __rte_cache_aligned; >>> >> >> Latest location for '__rte_cache_aligned' tag is at the beginning of the >> struct [1], so something like: >> `struct __rte_cache_aligned pkt_rx_queue {` >> >> [1] >> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 10:21 ` Ferruh Yigit @ 2024-04-24 10:28 ` Bruce Richardson 2024-04-24 18:02 ` Ferruh Yigit 2024-04-24 11:57 ` Mattias Rönnblom 1 sibling, 1 reply; 32+ messages in thread From: Bruce Richardson @ 2024-04-24 10:28 UTC (permalink / raw) To: Ferruh Yigit Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote: > On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: > > On 2024-04-23 13:15, Ferruh Yigit wrote: > >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: > >>> Cache align Rx and Tx queue struct to avoid false sharing. > >>> > >>> RX struct happens to be 64 bytes on x86_64 already, so cache > >>> alignment makes no change there, but it does on 32-bit ISAs. > >>> > >>> TX struct is 56 bytes on x86_64. > >>> > >> > >> Hi Mattias, > >> > >> No objection to the patch. Is the improvement theoretical or do you > >> measure any improvement practically, if so how much is the > >> improvement? > >> > > > > I didn't run any benchmarks. > > > > Two cores storing to a (falsely) shared cache line on a per-packet > > basis is going to be very expensive, at least for "light touch" > > applications. > > > > ack I expect for af_packet bottleneck is the kernel side, so I don't > expect any visible improvement practically, but OK to fix this > theoretical issue. > > >>> Both structs keep counters, and in the RX case they are updated even > >>> for empty polls. > >>> > >> > >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within > >> the loop? > >> > > > > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes > > are declared volatile, so you are forcing a load-modify-store cycle for > > every increment. > > > > My intention was to prevent updating stats in empty polls, I thought > stats will be hot in the cache but won't work with volatile. > Yes, it will. Volatile only prevents caching in registers, it does not affect the storing of data within the cache hierarchy. Reads/writes of stats on empty polls should indeed hit the L1 as expected. However, that is still less efficient than just doing a register increment which could theoretically be the result without the volatile. /Bruce ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 10:28 ` Bruce Richardson @ 2024-04-24 18:02 ` Ferruh Yigit 0 siblings, 0 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-24 18:02 UTC (permalink / raw) To: Bruce Richardson Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On 4/24/2024 11:28 AM, Bruce Richardson wrote: > On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote: >> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: >>> On 2024-04-23 13:15, Ferruh Yigit wrote: >>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>>> >>>>> RX struct happens to be 64 bytes on x86_64 already, so cache >>>>> alignment makes no change there, but it does on 32-bit ISAs. >>>>> >>>>> TX struct is 56 bytes on x86_64. >>>>> >>>> >>>> Hi Mattias, >>>> >>>> No objection to the patch. Is the improvement theoretical or do you >>>> measure any improvement practically, if so how much is the >>>> improvement? >>>> >>> >>> I didn't run any benchmarks. >>> >>> Two cores storing to a (falsely) shared cache line on a per-packet >>> basis is going to be very expensive, at least for "light touch" >>> applications. >>> >> >> ack I expect for af_packet bottleneck is the kernel side, so I don't >> expect any visible improvement practically, but OK to fix this >> theoretical issue. >> >>>>> Both structs keep counters, and in the RX case they are updated even >>>>> for empty polls. >>>>> >>>> >>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>>> the loop? >>>> >>> >>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes >>> are declared volatile, so you are forcing a load-modify-store cycle for >>> every increment. >>> >> >> My intention was to prevent updating stats in empty polls, I thought >> stats will be hot in the cache but won't work with volatile. >> > Yes, it will. Volatile only prevents caching in registers, it does not > affect the storing of data within the cache hierarchy. Reads/writes of > stats on empty polls should indeed hit the L1 as expected. However, that is > still less efficient than just doing a register increment which could > theoretically be the result without the volatile. > > Thanks Bruce for clarification. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 10:21 ` Ferruh Yigit 2024-04-24 10:28 ` Bruce Richardson @ 2024-04-24 11:57 ` Mattias Rönnblom 2024-04-24 17:50 ` Ferruh Yigit 1 sibling, 1 reply; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-24 11:57 UTC (permalink / raw) To: Ferruh Yigit, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli On 2024-04-24 12:21, Ferruh Yigit wrote: > On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: >> On 2024-04-23 13:15, Ferruh Yigit wrote: >>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>> >>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>>> makes no change there, but it does on 32-bit ISAs. >>>> >>>> TX struct is 56 bytes on x86_64. >>>> >>> >>> Hi Mattias, >>> >>> No objection to the patch. Is the improvement theoretical or do you >>> measure any improvement practically, if so how much is the improvement? >>> >> >> I didn't run any benchmarks. >> >> Two cores storing to a (falsely) shared cache line on a per-packet basis >> is going to be very expensive, at least for "light touch" applications. >> > > ack > I expect for af_packet bottleneck is the kernel side, so I don't expect > any visible improvement practically, but OK to fix this theoretical issue. > If you use af_packet as a slow path interface, you may well poll it often. If you are unlucky and aren't on an ISA where the RX struct *by accident* is cache-line aligned, you will suffer greatly from false sharing, because the counters are written to regardless if there are packets polled or not. If you don't care about synchronization overhead, why even support multi-queue in the af_packet driver!? >>>> Both structs keep counters, and in the RX case they are updated even >>>> for empty polls. >>>> >>> >>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>> the loop? >>> >> >> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes >> are declared volatile, so you are forcing a load-modify-store cycle for >> every increment. >> > > My intention was to prevent updating stats in empty polls, I thought > stats will be hot in the cache but won't work with volatile. > > >> I would drop "volatile", or replace it with an atomic (although *not* >> use an atomic add for incrementing, but rather atomic load + <n> >> non-atomic adds + atomic store). >> > > As only single core polls from an Rx queue, I assume atomics is required > only for stats reset case. And for that case won't we need atomic add? > Correct me if I'm wrong here, but my impression is that statistics reset tend to be best-effort in DPDK. Or, expressed in a different way, are not thread-safe. If you want to support MT safe counter reset, the counter increment operation must be an atomic add, since now you have two writers. (Unless you do something fancy and keep only a new offset upon reset, and never actually zero the counter.) > And do you know if there is a performance difference between atomic add > and keeping volatile qualifier? > > I don't know how slow af_packet is, but if you care about performance, you don't want to use atomic add for statistics. volatile in this case it's not a big issue, performance-wise, I would expect. >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>>> --- >>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>> index 397a32db58..28aeb7d08e 100644 >>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>> @@ -6,6 +6,7 @@ >>>> * All rights reserved. >>>> */ >>>> +#include <rte_common.h> >>>> #include <rte_string_fns.h> >>>> #include <rte_mbuf.h> >>>> #include <ethdev_driver.h> >>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>> volatile unsigned long rx_pkts; >>>> volatile unsigned long rx_bytes; >>>> -}; >>>> +} __rte_cache_aligned; >>>> >>> >>> Latest location for '__rte_cache_aligned' tag is at the beginning of the >>> struct [1], so something like: >>> `struct __rte_cache_aligned pkt_rx_queue {` >>> >>> [1] >>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 11:57 ` Mattias Rönnblom @ 2024-04-24 17:50 ` Ferruh Yigit 2024-04-24 19:13 ` Stephen Hemminger 0 siblings, 1 reply; 32+ messages in thread From: Ferruh Yigit @ 2024-04-24 17:50 UTC (permalink / raw) To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff, Honnappa Nagarahalli On 4/24/2024 12:57 PM, Mattias Rönnblom wrote: > On 2024-04-24 12:21, Ferruh Yigit wrote: >> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: >>> On 2024-04-23 13:15, Ferruh Yigit wrote: >>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>>> >>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>>>> makes no change there, but it does on 32-bit ISAs. >>>>> >>>>> TX struct is 56 bytes on x86_64. >>>>> >>>> >>>> Hi Mattias, >>>> >>>> No objection to the patch. Is the improvement theoretical or do you >>>> measure any improvement practically, if so how much is the improvement? >>>> >>> >>> I didn't run any benchmarks. >>> >>> Two cores storing to a (falsely) shared cache line on a per-packet basis >>> is going to be very expensive, at least for "light touch" applications. >>> >> >> ack >> I expect for af_packet bottleneck is the kernel side, so I don't expect >> any visible improvement practically, but OK to fix this theoretical >> issue. >> > > If you use af_packet as a slow path interface, you may well poll it > often. If you are unlucky and aren't on an ISA where the RX struct *by > accident* is cache-line aligned, you will suffer greatly from false > sharing, because the counters are written to regardless if there are > packets polled or not. > > If you don't care about synchronization overhead, why even support > multi-queue in the af_packet driver!? > We care, only I expect no measurable performance improvement and was asking if you have data for it. >>>>> Both structs keep counters, and in the RX case they are updated even >>>>> for empty polls. >>>>> >>>> >>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>>> the loop? >>>> >>> >>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes >>> are declared volatile, so you are forcing a load-modify-store cycle for >>> every increment. >>> >> >> My intention was to prevent updating stats in empty polls, I thought >> stats will be hot in the cache but won't work with volatile. >> >> >>> I would drop "volatile", or replace it with an atomic (although *not* >>> use an atomic add for incrementing, but rather atomic load + <n> >>> non-atomic adds + atomic store). >>> >> >> As only single core polls from an Rx queue, I assume atomics is required >> only for stats reset case. And for that case won't we need atomic add? >> > > Correct me if I'm wrong here, but my impression is that statistics reset > tend to be best-effort in DPDK. Or, expressed in a different way, are > not thread-safe. > At least I don't see this is something documented. For cases HW tracks stats we don't have this problem. Or some drivers use offset mechanism you mentioned below. But I can see some soft drivers will have this problem with stats clear. > If you want to support MT safe counter reset, the counter increment > operation must be an atomic add, since now you have two writers. (Unless > you do something fancy and keep only a new offset upon reset, and never > actually zero the counter.) > Asking to learn, if we ignore the stats reset case and assume stats only incremented, single writer case, do we still need either volatile qualifier or atomic operations at all? >> And do you know if there is a performance difference between atomic add >> and keeping volatile qualifier? >> >> > > I don't know how slow af_packet is, but if you care about performance, > you don't want to use atomic add for statistics. > There are a few soft drivers already using atomics adds for updating stats. If we document expectations from 'rte_eth_stats_reset()', we can update those usages. > volatile in this case it's not a big issue, performance-wise, I would > expect. > >>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>>>> --- >>>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> index 397a32db58..28aeb7d08e 100644 >>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> @@ -6,6 +6,7 @@ >>>>> * All rights reserved. >>>>> */ >>>>> +#include <rte_common.h> >>>>> #include <rte_string_fns.h> >>>>> #include <rte_mbuf.h> >>>>> #include <ethdev_driver.h> >>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>>> volatile unsigned long rx_pkts; >>>>> volatile unsigned long rx_bytes; >>>>> -}; >>>>> +} __rte_cache_aligned; >>>>> >>>> >>>> Latest location for '__rte_cache_aligned' tag is at the beginning of >>>> the >>>> struct [1], so something like: >>>> `struct __rte_cache_aligned pkt_rx_queue {` >>>> >>>> [1] >>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both >> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 17:50 ` Ferruh Yigit @ 2024-04-24 19:13 ` Stephen Hemminger 2024-04-24 22:27 ` Mattias Rönnblom 0 siblings, 1 reply; 32+ messages in thread From: Stephen Hemminger @ 2024-04-24 19:13 UTC (permalink / raw) To: Ferruh Yigit Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On Wed, 24 Apr 2024 18:50:50 +0100 Ferruh Yigit <ferruh.yigit@amd.com> wrote: > > I don't know how slow af_packet is, but if you care about performance, > > you don't want to use atomic add for statistics. > > > > There are a few soft drivers already using atomics adds for updating stats. > If we document expectations from 'rte_eth_stats_reset()', we can update > those usages. Using atomic add is lots of extra overhead. The statistics are not guaranteed to be perfect. If nothing else, the bytes and packets can be skewed. The soft drivers af_xdp, af_packet, and tun performance is dominated by the overhead of the kernel system call and copies. Yes, alignment is good but won't be noticeable. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 19:13 ` Stephen Hemminger @ 2024-04-24 22:27 ` Mattias Rönnblom 2024-04-24 23:55 ` Stephen Hemminger 0 siblings, 1 reply; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-24 22:27 UTC (permalink / raw) To: Stephen Hemminger, Ferruh Yigit Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On 2024-04-24 21:13, Stephen Hemminger wrote: > On Wed, 24 Apr 2024 18:50:50 +0100 > Ferruh Yigit <ferruh.yigit@amd.com> wrote: > >>> I don't know how slow af_packet is, but if you care about performance, >>> you don't want to use atomic add for statistics. >>> >> >> There are a few soft drivers already using atomics adds for updating stats. >> If we document expectations from 'rte_eth_stats_reset()', we can update >> those usages. > > Using atomic add is lots of extra overhead. The statistics are not guaranteed > to be perfect. If nothing else, the bytes and packets can be skewed. > The sad thing here is that in case the counters are reset within the load-modify-store cycle of the lcore counter update, the reset may end up being a nop. So, it's not like you missed a packet or two, or suffer some transient inconsistency, but you completed and permanently ignored the reset request. > The soft drivers af_xdp, af_packet, and tun performance is dominated by the > overhead of the kernel system call and copies. Yes, alignment is good > but won't be noticeable. There aren't any syscalls in the RX path in the af_packet PMD. I added the same statistics updates as the af_packet PMD uses into an benchmark app which consumes ~1000 cc in-between stats updates. If the equivalent of the RX queue struct was cache aligned, the statistics overhead was so small it was difficult to measure. Less than 3-4 cc per update. This was with volatile, but without atomics. If the RX queue struct wasn't cache aligned, and sized so a cache line generally was used by two (neighboring) cores, the stats incurred a cost of ~55 cc per update. Shaving off 55 cc should translate to a couple of hundred percent increased performance for an empty af_packet poll. If your lcore has some other primary source of work than the af_packet RX queue, and the RX queue is polled often, then this may well be a noticeable gain. The benchmark was run on 16 Gracemont cores, which in my experience seems to have a little shorter core-to-core latency than many other systems, provided the remote core/cache line owner is located in the same cluster. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 22:27 ` Mattias Rönnblom @ 2024-04-24 23:55 ` Stephen Hemminger 2024-04-25 9:26 ` Mattias Rönnblom 0 siblings, 1 reply; 32+ messages in thread From: Stephen Hemminger @ 2024-04-24 23:55 UTC (permalink / raw) To: Mattias Rönnblom Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On Thu, 25 Apr 2024 00:27:36 +0200 Mattias Rönnblom <hofors@lysator.liu.se> wrote: > On 2024-04-24 21:13, Stephen Hemminger wrote: > > On Wed, 24 Apr 2024 18:50:50 +0100 > > Ferruh Yigit <ferruh.yigit@amd.com> wrote: > > > >>> I don't know how slow af_packet is, but if you care about performance, > >>> you don't want to use atomic add for statistics. > >>> > >> > >> There are a few soft drivers already using atomics adds for updating stats. > >> If we document expectations from 'rte_eth_stats_reset()', we can update > >> those usages. > > > > Using atomic add is lots of extra overhead. The statistics are not guaranteed > > to be perfect. If nothing else, the bytes and packets can be skewed. > > > > The sad thing here is that in case the counters are reset within the > load-modify-store cycle of the lcore counter update, the reset may end > up being a nop. So, it's not like you missed a packet or two, or suffer > some transient inconsistency, but you completed and permanently ignored > the reset request. That is one of the many reasons the Linux kernel intentionally did not implement a reset statistics operation. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-24 23:55 ` Stephen Hemminger @ 2024-04-25 9:26 ` Mattias Rönnblom 2024-04-25 9:49 ` Morten Brørup 2024-04-25 14:04 ` Ferruh Yigit 0 siblings, 2 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-25 9:26 UTC (permalink / raw) To: Stephen Hemminger Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli On 2024-04-25 01:55, Stephen Hemminger wrote: > On Thu, 25 Apr 2024 00:27:36 +0200 > Mattias Rönnblom <hofors@lysator.liu.se> wrote: > >> On 2024-04-24 21:13, Stephen Hemminger wrote: >>> On Wed, 24 Apr 2024 18:50:50 +0100 >>> Ferruh Yigit <ferruh.yigit@amd.com> wrote: >>> >>>>> I don't know how slow af_packet is, but if you care about performance, >>>>> you don't want to use atomic add for statistics. >>>>> >>>> >>>> There are a few soft drivers already using atomics adds for updating stats. >>>> If we document expectations from 'rte_eth_stats_reset()', we can update >>>> those usages. >>> >>> Using atomic add is lots of extra overhead. The statistics are not guaranteed >>> to be perfect. If nothing else, the bytes and packets can be skewed. >>> >> >> The sad thing here is that in case the counters are reset within the >> load-modify-store cycle of the lcore counter update, the reset may end >> up being a nop. So, it's not like you missed a packet or two, or suffer >> some transient inconsistency, but you completed and permanently ignored >> the reset request. > > That is one of the many reasons the Linux kernel intentionally did not > implement a reset statistics operation. DPDK should deprecate statistics reset, it seems to me. The current API is broken anyway, if you care about correctness. A reset function must return the current state of the counters, at the point them being reset. Otherwise, a higher layer may miss counter updates. The af_packet PMD should return -ENOTSUP on reset (which is allowed by the API). Maintaining an offset-since-last-reset for counters is a control plane thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed counters are to be expected from the PMDs, we should have some helper API to facilitate its efficient & correct-enough implementation.) ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 9:26 ` Mattias Rönnblom @ 2024-04-25 9:49 ` Morten Brørup 2024-04-25 14:04 ` Ferruh Yigit 1 sibling, 0 replies; 32+ messages in thread From: Morten Brørup @ 2024-04-25 9:49 UTC (permalink / raw) To: Mattias Rönnblom, Stephen Hemminger Cc: Ferruh Yigit, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] > Sent: Thursday, 25 April 2024 11.26 > > On 2024-04-25 01:55, Stephen Hemminger wrote: > > On Thu, 25 Apr 2024 00:27:36 +0200 > > Mattias Rönnblom <hofors@lysator.liu.se> wrote: > > > >> On 2024-04-24 21:13, Stephen Hemminger wrote: > >>> On Wed, 24 Apr 2024 18:50:50 +0100 > >>> Ferruh Yigit <ferruh.yigit@amd.com> wrote: > >>> > >>>>> I don't know how slow af_packet is, but if you care about performance, > >>>>> you don't want to use atomic add for statistics. > >>>>> > >>>> > >>>> There are a few soft drivers already using atomics adds for updating > stats. > >>>> If we document expectations from 'rte_eth_stats_reset()', we can update > >>>> those usages. > >>> > >>> Using atomic add is lots of extra overhead. The statistics are not > guaranteed > >>> to be perfect. If nothing else, the bytes and packets can be skewed. > >>> > >> > >> The sad thing here is that in case the counters are reset within the > >> load-modify-store cycle of the lcore counter update, the reset may end > >> up being a nop. So, it's not like you missed a packet or two, or suffer > >> some transient inconsistency, but you completed and permanently ignored > >> the reset request. > > > > That is one of the many reasons the Linux kernel intentionally did not > > implement a reset statistics operation. > > DPDK should deprecate statistics reset, it seems to me. +1 > > The current API is broken anyway, if you care about correctness. A reset > function must return the current state of the counters, at the point > them being reset. Otherwise, a higher layer may miss counter updates. > > The af_packet PMD should return -ENOTSUP on reset (which is allowed by > the API). > > Maintaining an offset-since-last-reset for counters is a control plane > thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed > counters are to be expected from the PMDs, we should have some helper > API to facilitate its efficient & correct-enough implementation.) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 9:26 ` Mattias Rönnblom 2024-04-25 9:49 ` Morten Brørup @ 2024-04-25 14:04 ` Ferruh Yigit 2024-04-25 15:06 ` Mattias Rönnblom 2024-04-25 15:07 ` Stephen Hemminger 1 sibling, 2 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-25 14:04 UTC (permalink / raw) To: Mattias Rönnblom, Stephen Hemminger Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli, Morten Brørup On 4/25/2024 10:26 AM, Mattias Rönnblom wrote: > On 2024-04-25 01:55, Stephen Hemminger wrote: >> On Thu, 25 Apr 2024 00:27:36 +0200 >> Mattias Rönnblom <hofors@lysator.liu.se> wrote: >> >>> On 2024-04-24 21:13, Stephen Hemminger wrote: >>>> On Wed, 24 Apr 2024 18:50:50 +0100 >>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote: >>>> >>>>>> I don't know how slow af_packet is, but if you care about >>>>>> performance, >>>>>> you don't want to use atomic add for statistics. >>>>>> >>>>> >>>>> There are a few soft drivers already using atomics adds for >>>>> updating stats. >>>>> If we document expectations from 'rte_eth_stats_reset()', we can >>>>> update >>>>> those usages. >>>> >>>> Using atomic add is lots of extra overhead. The statistics are not >>>> guaranteed >>>> to be perfect. If nothing else, the bytes and packets can be skewed. >>>> >>> >>> The sad thing here is that in case the counters are reset within the >>> load-modify-store cycle of the lcore counter update, the reset may end >>> up being a nop. So, it's not like you missed a packet or two, or suffer >>> some transient inconsistency, but you completed and permanently ignored >>> the reset request. >> >> That is one of the many reasons the Linux kernel intentionally did not >> implement a reset statistics operation. > > DPDK should deprecate statistics reset, it seems to me. > > The current API is broken anyway, if you care about correctness. A reset > function must return the current state of the counters, at the point > them being reset. Otherwise, a higher layer may miss counter updates. > > The af_packet PMD should return -ENOTSUP on reset (which is allowed by > the API). > > Maintaining an offset-since-last-reset for counters is a control plane > thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed > counters are to be expected from the PMDs, we should have some helper > API to facilitate its efficient & correct-enough implementation.) > statistics reset works for HW devices, instead of removing statics reset I am for documenting API that it may be not reliable, I can send a patch for it. With above change, we can be more relax on stats update specially for soft drivers, and can convert atomic_add stats updates to "atomic load + add + atomic store". Does this plan make sense? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 14:04 ` Ferruh Yigit @ 2024-04-25 15:06 ` Mattias Rönnblom 2024-04-25 16:21 ` Ferruh Yigit 2024-04-25 15:07 ` Stephen Hemminger 1 sibling, 1 reply; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-25 15:06 UTC (permalink / raw) To: Ferruh Yigit, Stephen Hemminger Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli, Morten Brørup On 2024-04-25 16:04, Ferruh Yigit wrote: > On 4/25/2024 10:26 AM, Mattias Rönnblom wrote: >> On 2024-04-25 01:55, Stephen Hemminger wrote: >>> On Thu, 25 Apr 2024 00:27:36 +0200 >>> Mattias Rönnblom <hofors@lysator.liu.se> wrote: >>> >>>> On 2024-04-24 21:13, Stephen Hemminger wrote: >>>>> On Wed, 24 Apr 2024 18:50:50 +0100 >>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote: >>>>> >>>>>>> I don't know how slow af_packet is, but if you care about >>>>>>> performance, >>>>>>> you don't want to use atomic add for statistics. >>>>>>> >>>>>> >>>>>> There are a few soft drivers already using atomics adds for >>>>>> updating stats. >>>>>> If we document expectations from 'rte_eth_stats_reset()', we can >>>>>> update >>>>>> those usages. >>>>> >>>>> Using atomic add is lots of extra overhead. The statistics are not >>>>> guaranteed >>>>> to be perfect. If nothing else, the bytes and packets can be skewed. >>>>> >>>> >>>> The sad thing here is that in case the counters are reset within the >>>> load-modify-store cycle of the lcore counter update, the reset may end >>>> up being a nop. So, it's not like you missed a packet or two, or suffer >>>> some transient inconsistency, but you completed and permanently ignored >>>> the reset request. >>> >>> That is one of the many reasons the Linux kernel intentionally did not >>> implement a reset statistics operation. >> >> DPDK should deprecate statistics reset, it seems to me. >> >> The current API is broken anyway, if you care about correctness. A reset >> function must return the current state of the counters, at the point >> them being reset. Otherwise, a higher layer may miss counter updates. >> >> The af_packet PMD should return -ENOTSUP on reset (which is allowed by >> the API). >> >> Maintaining an offset-since-last-reset for counters is a control plane >> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed >> counters are to be expected from the PMDs, we should have some helper >> API to facilitate its efficient & correct-enough implementation.) >> > > statistics reset works for HW devices, instead of removing statics reset > I am for documenting API that it may be not reliable, I can send a patch > for it. > With API you mean <rte_ethdev.h>? If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes doesn't, it should also have a name that reflect those semantics. rte_ethdev_stats_reset_or_perhaps_not() rte_ethdev_stats_usually_reset() Rather than expecting the application to deal with unspecified non-determinism is seems better to specify under which conditions the reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will limit it usefulness though. Also, it will make having an MT safe reset in a PMD pretty useless, except if the app is programmed not toward the API, but toward some particular PMD. > With above change, we can be more relax on stats update specially for > soft drivers, and can convert atomic_add stats updates to "atomic load + > add + atomic store". > > Does this plan make sense? Not really. Short-term the -ENOTSUP seems like the best option. Second best option is to implement a proper MT safe reset. What is unfortunate is that the API is silent on MT safety. I've *assumed* that many users will have assumed it MT safe, but there's nothing in the documentation to support that. Rather the opposite, the generic section of DPDK MT safety only mentions PMD RX and TX functions. This issue is not limited to this PMD or even to ethdev. rte_event_dev_xstats_reset() and some of the event devs have the same problem. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 15:06 ` Mattias Rönnblom @ 2024-04-25 16:21 ` Ferruh Yigit 0 siblings, 0 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-25 16:21 UTC (permalink / raw) To: Mattias Rönnblom, Stephen Hemminger Cc: Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli, Morten Brørup On 4/25/2024 4:06 PM, Mattias Rönnblom wrote: > On 2024-04-25 16:04, Ferruh Yigit wrote: >> On 4/25/2024 10:26 AM, Mattias Rönnblom wrote: >>> On 2024-04-25 01:55, Stephen Hemminger wrote: >>>> On Thu, 25 Apr 2024 00:27:36 +0200 >>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote: >>>> >>>>> On 2024-04-24 21:13, Stephen Hemminger wrote: >>>>>> On Wed, 24 Apr 2024 18:50:50 +0100 >>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote: >>>>>> >>>>>>>> I don't know how slow af_packet is, but if you care about >>>>>>>> performance, >>>>>>>> you don't want to use atomic add for statistics. >>>>>>>> >>>>>>> >>>>>>> There are a few soft drivers already using atomics adds for >>>>>>> updating stats. >>>>>>> If we document expectations from 'rte_eth_stats_reset()', we can >>>>>>> update >>>>>>> those usages. >>>>>> >>>>>> Using atomic add is lots of extra overhead. The statistics are not >>>>>> guaranteed >>>>>> to be perfect. If nothing else, the bytes and packets can be skewed. >>>>>> >>>>> >>>>> The sad thing here is that in case the counters are reset within the >>>>> load-modify-store cycle of the lcore counter update, the reset may end >>>>> up being a nop. So, it's not like you missed a packet or two, or >>>>> suffer >>>>> some transient inconsistency, but you completed and permanently >>>>> ignored >>>>> the reset request. >>>> >>>> That is one of the many reasons the Linux kernel intentionally did not >>>> implement a reset statistics operation. >>> >>> DPDK should deprecate statistics reset, it seems to me. >>> >>> The current API is broken anyway, if you care about correctness. A reset >>> function must return the current state of the counters, at the point >>> them being reset. Otherwise, a higher layer may miss counter updates. >>> >>> The af_packet PMD should return -ENOTSUP on reset (which is allowed by >>> the API). >>> >>> Maintaining an offset-since-last-reset for counters is a control plane >>> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed >>> counters are to be expected from the PMDs, we should have some helper >>> API to facilitate its efficient & correct-enough implementation.) >>> >> >> statistics reset works for HW devices, instead of removing statics reset >> I am for documenting API that it may be not reliable, I can send a patch >> for it. >> > > With API you mean <rte_ethdev.h>? > > If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes > doesn't, it should also have a name that reflect those semantics. > rte_ethdev_stats_reset_or_perhaps_not() > rte_ethdev_stats_usually_reset() > :) point taken > Rather than expecting the application to deal with unspecified > non-determinism is seems better to specify under which conditions the > reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will > limit it usefulness though. Also, it will make having an MT safe reset > in a PMD pretty useless, except if the app is programmed not toward the > API, but toward some particular PMD. > >> With above change, we can be more relax on stats update specially for >> soft drivers, and can convert atomic_add stats updates to "atomic load + >> add + atomic store". >> >> Does this plan make sense? > > Not really. > > Short-term the -ENOTSUP seems like the best option. Second best option > is to implement a proper MT safe reset. > > What is unfortunate is that the API is silent on MT safety. I've > *assumed* that many users will have assumed it MT safe, but there's > nothing in the documentation to support that. Rather the opposite, the > generic section of DPDK MT safety only mentions PMD RX and TX functions. > > This issue is not limited to this PMD or even to ethdev. > rte_event_dev_xstats_reset() and some of the event devs have the same > problem. > True, and we can document multi thread safety of APIs at least, this is small effort. For reliable stats reset, implementing offset logic is not that hard, and enables having more optimized stats update in datapath. We can go with this option instead of unreliable stats reset. I can add this at least for af_packet as sample. As far as I can see, with this update we can get rid of volatile qualifier and atomic set for stats variables. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 14:04 ` Ferruh Yigit 2024-04-25 15:06 ` Mattias Rönnblom @ 2024-04-25 15:07 ` Stephen Hemminger 1 sibling, 0 replies; 32+ messages in thread From: Stephen Hemminger @ 2024-04-25 15:07 UTC (permalink / raw) To: Ferruh Yigit Cc: Mattias Rönnblom, Mattias Rönnblom, John W . Linville, dev, Tyler Retzlaff, Honnappa Nagarahalli, Morten Brørup On Thu, 25 Apr 2024 15:04:27 +0100 Ferruh Yigit <ferruh.yigit@amd.com> wrote: > > Maintaining an offset-since-last-reset for counters is a control plane > > thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed > > counters are to be expected from the PMDs, we should have some helper > > API to facilitate its efficient & correct-enough implementation.) > > > > statistics reset works for HW devices, instead of removing statics reset > I am for documenting API that it may be not reliable, I can send a patch > for it. > > With above change, we can be more relax on stats update specially for > soft drivers, and can convert atomic_add stats updates to "atomic load + > add + atomic store". > > Does this plan make sense? +1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 11:15 ` Ferruh Yigit 2024-04-23 20:56 ` Mattias Rönnblom @ 2024-04-25 14:08 ` Ferruh Yigit 2024-04-25 15:08 ` Mattias Rönnblom 1 sibling, 1 reply; 32+ messages in thread From: Ferruh Yigit @ 2024-04-25 14:08 UTC (permalink / raw) To: Mattias Rönnblom, John W . Linville; +Cc: dev, hofors, Tyler Retzlaff On 4/23/2024 12:15 PM, Ferruh Yigit wrote: >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >> --- >> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >> index 397a32db58..28aeb7d08e 100644 >> --- a/drivers/net/af_packet/rte_eth_af_packet.c >> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >> @@ -6,6 +6,7 @@ >> * All rights reserved. >> */ >> >> +#include <rte_common.h> >> #include <rte_string_fns.h> >> #include <rte_mbuf.h> >> #include <ethdev_driver.h> >> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >> >> volatile unsigned long rx_pkts; >> volatile unsigned long rx_bytes; >> -}; >> +} __rte_cache_aligned; >> > Latest location for '__rte_cache_aligned' tag is at the beginning of the > struct [1], so something like: > `struct __rte_cache_aligned pkt_rx_queue {` > > [1] > https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both > Hi Mattias, We dived into side discussions but with above change I can proceed with the patch. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 14:08 ` Ferruh Yigit @ 2024-04-25 15:08 ` Mattias Rönnblom 2024-04-25 15:35 ` Ferruh Yigit 0 siblings, 1 reply; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-25 15:08 UTC (permalink / raw) To: Ferruh Yigit, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff On 2024-04-25 16:08, Ferruh Yigit wrote: > On 4/23/2024 12:15 PM, Ferruh Yigit wrote: >>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >>> index 397a32db58..28aeb7d08e 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -6,6 +6,7 @@ >>> * All rights reserved. >>> */ >>> >>> +#include <rte_common.h> >>> #include <rte_string_fns.h> >>> #include <rte_mbuf.h> >>> #include <ethdev_driver.h> >>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>> >>> volatile unsigned long rx_pkts; >>> volatile unsigned long rx_bytes; >>> -}; >>> +} __rte_cache_aligned; >>> >> Latest location for '__rte_cache_aligned' tag is at the beginning of the >> struct [1], so something like: >> `struct __rte_cache_aligned pkt_rx_queue {` >> >> [1] >> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both >> > > Hi Mattias, > > We dived into side discussions but with above change I can proceed with > the patch. OK. Should this go into some stable branch as well? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 15:08 ` Mattias Rönnblom @ 2024-04-25 15:35 ` Ferruh Yigit 2024-04-26 7:25 ` Mattias Rönnblom 0 siblings, 1 reply; 32+ messages in thread From: Ferruh Yigit @ 2024-04-25 15:35 UTC (permalink / raw) To: Mattias Rönnblom, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff On 4/25/2024 4:08 PM, Mattias Rönnblom wrote: > On 2024-04-25 16:08, Ferruh Yigit wrote: >> On 4/23/2024 12:15 PM, Ferruh Yigit wrote: >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>>> --- >>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>> index 397a32db58..28aeb7d08e 100644 >>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>> @@ -6,6 +6,7 @@ >>>> * All rights reserved. >>>> */ >>>> +#include <rte_common.h> >>>> #include <rte_string_fns.h> >>>> #include <rte_mbuf.h> >>>> #include <ethdev_driver.h> >>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>> volatile unsigned long rx_pkts; >>>> volatile unsigned long rx_bytes; >>>> -}; >>>> +} __rte_cache_aligned; >>>> >>> Latest location for '__rte_cache_aligned' tag is at the beginning of the >>> struct [1], so something like: >>> `struct __rte_cache_aligned pkt_rx_queue {` >>> >>> [1] >>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both >>> >> >> Hi Mattias, >> >> We dived into side discussions but with above change I can proceed with >> the patch. > > OK. > > Should this go into some stable branch as well? > I don't see any reason for not merging to stable trees. For this please add fixes and stable tags. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-25 15:35 ` Ferruh Yigit @ 2024-04-26 7:25 ` Mattias Rönnblom 0 siblings, 0 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-26 7:25 UTC (permalink / raw) To: Ferruh Yigit, Mattias Rönnblom, John W . Linville Cc: dev, Tyler Retzlaff On 2024-04-25 17:35, Ferruh Yigit wrote: > On 4/25/2024 4:08 PM, Mattias Rönnblom wrote: >> On 2024-04-25 16:08, Ferruh Yigit wrote: >>> On 4/23/2024 12:15 PM, Ferruh Yigit wrote: >>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >>>>> --- >>>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> index 397a32db58..28aeb7d08e 100644 >>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> @@ -6,6 +6,7 @@ >>>>> * All rights reserved. >>>>> */ >>>>> +#include <rte_common.h> >>>>> #include <rte_string_fns.h> >>>>> #include <rte_mbuf.h> >>>>> #include <ethdev_driver.h> >>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>>> volatile unsigned long rx_pkts; >>>>> volatile unsigned long rx_bytes; >>>>> -}; >>>>> +} __rte_cache_aligned; >>>>> >>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the >>>> struct [1], so something like: >>>> `struct __rte_cache_aligned pkt_rx_queue {` >>>> >>>> [1] >>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both >>>> >>> >>> Hi Mattias, >>> >>> We dived into side discussions but with above change I can proceed with >>> the patch. >> >> OK. >> >> Should this go into some stable branch as well? >> > > I don't see any reason for not merging to stable trees. For this please > add fixes and stable tags. OK, I'll submit a v2. Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom 2024-04-23 11:15 ` Ferruh Yigit @ 2024-04-26 7:38 ` Mattias Rönnblom 2024-04-26 8:27 ` Ferruh Yigit 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom 2024-04-26 21:27 ` [PATCH] " Patrick Robb 2 siblings, 2 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-26 7:38 UTC (permalink / raw) To: dev Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, Ferruh Yigit, Mattias Rönnblom, stable Cache align Rx and Tx queue struct to avoid false sharing. RX struct happens to be 64 bytes on x86_64 already, so cache alignment has no effect there, but it does on 32-bit ISAs. TX struct is 56 bytes on x86_64. Both structs keep counters, and in the RX case they are updated even for empty polls. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: stable@dpdk.org Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 397a32db58..28aeb7d08e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -6,6 +6,7 @@ * All rights reserved. */ +#include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> #include <ethdev_driver.h> @@ -53,7 +54,7 @@ struct pkt_rx_queue { volatile unsigned long rx_pkts; volatile unsigned long rx_bytes; -}; +} __rte_cache_aligned; struct pkt_tx_queue { int sockfd; @@ -67,7 +68,7 @@ struct pkt_tx_queue { volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; -}; +} __rte_cache_aligned; struct pmd_internals { unsigned nb_queues; -- 2.34.1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-26 7:38 ` Mattias Rönnblom @ 2024-04-26 8:27 ` Ferruh Yigit 2024-04-26 10:20 ` Mattias Rönnblom 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom 1 sibling, 1 reply; 32+ messages in thread From: Ferruh Yigit @ 2024-04-26 8:27 UTC (permalink / raw) To: Mattias Rönnblom, dev Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, stable On 4/26/2024 8:38 AM, Mattias Rönnblom wrote: > Cache align Rx and Tx queue struct to avoid false sharing. > > RX struct happens to be 64 bytes on x86_64 already, so cache alignment > has no effect there, but it does on 32-bit ISAs. > > TX struct is 56 bytes on x86_64. > > Both structs keep counters, and in the RX case they are updated even > for empty polls. > > Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") > Cc: stable@dpdk.org > > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> > --- > drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c > index 397a32db58..28aeb7d08e 100644 > --- a/drivers/net/af_packet/rte_eth_af_packet.c > +++ b/drivers/net/af_packet/rte_eth_af_packet.c > @@ -6,6 +6,7 @@ > * All rights reserved. > */ > > +#include <rte_common.h> > #include <rte_string_fns.h> > #include <rte_mbuf.h> > #include <ethdev_driver.h> > @@ -53,7 +54,7 @@ struct pkt_rx_queue { > > volatile unsigned long rx_pkts; > volatile unsigned long rx_bytes; > -}; > +} __rte_cache_aligned; > > struct pkt_tx_queue { > int sockfd; > @@ -67,7 +68,7 @@ struct pkt_tx_queue { > volatile unsigned long tx_pkts; > volatile unsigned long err_pkts; > volatile unsigned long tx_bytes; > -}; > +} __rte_cache_aligned; > > struct pmd_internals { > unsigned nb_queues; Hi Mattias, I guess my comment missed, location of '__rte_cache_aligned' tag changed for MSVC support, it should be like: ``` struct __rte_cache_aligned pkt_rx_queue { ... }; ``` ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-26 8:27 ` Ferruh Yigit @ 2024-04-26 10:20 ` Mattias Rönnblom 0 siblings, 0 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-26 10:20 UTC (permalink / raw) To: Ferruh Yigit, Mattias Rönnblom, dev Cc: Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, stable On 2024-04-26 10:27, Ferruh Yigit wrote: > On 4/26/2024 8:38 AM, Mattias Rönnblom wrote: >> Cache align Rx and Tx queue struct to avoid false sharing. >> >> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >> has no effect there, but it does on 32-bit ISAs. >> >> TX struct is 56 bytes on x86_64. >> >> Both structs keep counters, and in the RX case they are updated even >> for empty polls. >> >> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") >> Cc: stable@dpdk.org >> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >> --- >> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c >> index 397a32db58..28aeb7d08e 100644 >> --- a/drivers/net/af_packet/rte_eth_af_packet.c >> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >> @@ -6,6 +6,7 @@ >> * All rights reserved. >> */ >> >> +#include <rte_common.h> >> #include <rte_string_fns.h> >> #include <rte_mbuf.h> >> #include <ethdev_driver.h> >> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >> >> volatile unsigned long rx_pkts; >> volatile unsigned long rx_bytes; >> -}; >> +} __rte_cache_aligned; >> >> struct pkt_tx_queue { >> int sockfd; >> @@ -67,7 +68,7 @@ struct pkt_tx_queue { >> volatile unsigned long tx_pkts; >> volatile unsigned long err_pkts; >> volatile unsigned long tx_bytes; >> -}; >> +} __rte_cache_aligned; >> >> struct pmd_internals { >> unsigned nb_queues; > > Hi Mattias, > > I guess my comment missed, location of '__rte_cache_aligned' tag changed > for MSVC support, it should be like: > ``` > struct __rte_cache_aligned pkt_rx_queue { > ... > }; > ``` Right. Sorry, forgot about that. I've made a v3 (which is actually marked v3). ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v3] net/af_packet: cache align Rx/Tx structs 2024-04-26 7:38 ` Mattias Rönnblom 2024-04-26 8:27 ` Ferruh Yigit @ 2024-04-26 9:05 ` Mattias Rönnblom 2024-04-26 9:22 ` Morten Brørup ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Mattias Rönnblom @ 2024-04-26 9:05 UTC (permalink / raw) To: dev Cc: hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, Ferruh Yigit, Mattias Rönnblom, stable Cache align Rx and Tx queue struct to avoid false sharing. The RX struct happens to be 64 bytes on x86_64 already, so cache alignment has no effect there, but it does on 32-bit ISAs. The TX struct is 56 bytes on x86_64. Both structs keep counters, and in the RX case they are updated even for empty polls. v3: Move __rte_cache_aligned directive to a MSVC-compatible location. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: stable@dpdk.org Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 397a32db58..6b7b16f348 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -6,6 +6,7 @@ * All rights reserved. */ +#include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> #include <ethdev_driver.h> @@ -39,7 +40,7 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) -struct pkt_rx_queue { +struct __rte_cache_aligned pkt_rx_queue { int sockfd; struct iovec *rd; @@ -55,7 +56,7 @@ struct pkt_rx_queue { volatile unsigned long rx_bytes; }; -struct pkt_tx_queue { +struct __rte_cache_aligned pkt_tx_queue { int sockfd; unsigned int frame_data_size; -- 2.34.1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH v3] net/af_packet: cache align Rx/Tx structs 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom @ 2024-04-26 9:22 ` Morten Brørup 2024-04-26 15:10 ` Stephen Hemminger 2024-04-26 15:41 ` Tyler Retzlaff 2 siblings, 0 replies; 32+ messages in thread From: Morten Brørup @ 2024-04-26 9:22 UTC (permalink / raw) To: Mattias Rönnblom, dev Cc: hofors, Stephen Hemminger, Thomas Monjalon, John W . Linville, Ferruh Yigit, stable > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com] > Sent: Friday, 26 April 2024 11.05 > > Cache align Rx and Tx queue struct to avoid false sharing. Should an RTE_CACHE_GUARD be added at the end of these two structs? If not, Reviewed-by: Morten Brørup <mb@smartsharesystems.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom 2024-04-26 9:22 ` Morten Brørup @ 2024-04-26 15:10 ` Stephen Hemminger 2024-04-26 15:41 ` Tyler Retzlaff 2 siblings, 0 replies; 32+ messages in thread From: Stephen Hemminger @ 2024-04-26 15:10 UTC (permalink / raw) To: Mattias Rönnblom Cc: dev, hofors, Morten Brørup, Thomas Monjalon, John W . Linville, Ferruh Yigit, stable On Fri, 26 Apr 2024 11:05:02 +0200 Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: > Cache align Rx and Tx queue struct to avoid false sharing. > > The RX struct happens to be 64 bytes on x86_64 already, so cache > alignment has no effect there, but it does on 32-bit ISAs. > > The TX struct is 56 bytes on x86_64. > > Both structs keep counters, and in the RX case they are updated even > for empty polls. > > v3: Move __rte_cache_aligned directive to a MSVC-compatible location. > > Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") > Cc: stable@dpdk.org > > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> Looks good. Acked-by: Stephen Hemminger <stephen@networkplumber.org> If you want to go further with this: - could rx and tx rings be next to each other so that any of the lookahead in cpu helps, not hurts? - what about secondary process support - would sendmmsg() and recvmmsg() help for bursting. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom 2024-04-26 9:22 ` Morten Brørup 2024-04-26 15:10 ` Stephen Hemminger @ 2024-04-26 15:41 ` Tyler Retzlaff 2024-04-29 8:46 ` Ferruh Yigit 2 siblings, 1 reply; 32+ messages in thread From: Tyler Retzlaff @ 2024-04-26 15:41 UTC (permalink / raw) To: Mattias Rönnblom Cc: dev, hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, Ferruh Yigit, stable On Fri, Apr 26, 2024 at 11:05:02AM +0200, Mattias Rönnblom wrote: > Cache align Rx and Tx queue struct to avoid false sharing. > > The RX struct happens to be 64 bytes on x86_64 already, so cache > alignment has no effect there, but it does on 32-bit ISAs. > > The TX struct is 56 bytes on x86_64. > > Both structs keep counters, and in the RX case they are updated even > for empty polls. > > v3: Move __rte_cache_aligned directive to a MSVC-compatible location. > > Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") > Cc: stable@dpdk.org > > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> > --- Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v3] net/af_packet: cache align Rx/Tx structs 2024-04-26 15:41 ` Tyler Retzlaff @ 2024-04-29 8:46 ` Ferruh Yigit 0 siblings, 0 replies; 32+ messages in thread From: Ferruh Yigit @ 2024-04-29 8:46 UTC (permalink / raw) To: Tyler Retzlaff, Mattias Rönnblom Cc: dev, hofors, Morten Brørup, Stephen Hemminger, Thomas Monjalon, John W . Linville, stable On 4/26/2024 4:41 PM, Tyler Retzlaff wrote: > On Fri, Apr 26, 2024 at 11:05:02AM +0200, Mattias Rönnblom wrote: >> Cache align Rx and Tx queue struct to avoid false sharing. >> >> The RX struct happens to be 64 bytes on x86_64 already, so cache >> alignment has no effect there, but it does on 32-bit ISAs. >> >> The TX struct is 56 bytes on x86_64. >> >> Both structs keep counters, and in the RX case they are updated even >> for empty polls. >> >> v3: Move __rte_cache_aligned directive to a MSVC-compatible location. >> >> Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") >> Cc: stable@dpdk.org >> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com> >> --- > > Reviewed-by: Morten Brørup <mb@smartsharesystems.com> > Acked-by: Stephen Hemminger <stephen@networkplumber.org> > Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com> > Applied to dpdk-next-net/main, thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] net/af_packet: cache align Rx/Tx structs 2024-04-23 9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom 2024-04-23 11:15 ` Ferruh Yigit 2024-04-26 7:38 ` Mattias Rönnblom @ 2024-04-26 21:27 ` Patrick Robb 2 siblings, 0 replies; 32+ messages in thread From: Patrick Robb @ 2024-04-26 21:27 UTC (permalink / raw) To: Mattias Rönnblom; +Cc: John W . Linville, dev, hofors [-- Attachment #1: Type: text/plain, Size: 296 bytes --] Recheck-request: iol-compile-amd64-testing The DPDK Community Lab updated to the latest Alpine image yesterday, which resulted in all Alpine builds failing. The failure is unrelated to your patch, and this recheck should remove the fail on Patchwork, as we have disabled Alpine testing for now. [-- Attachment #2: Type: text/html, Size: 361 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2024-04-29 8:46 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-04-23 9:08 [PATCH] net/af_packet: cache align Rx/Tx structs Mattias Rönnblom 2024-04-23 11:15 ` Ferruh Yigit 2024-04-23 20:56 ` Mattias Rönnblom 2024-04-24 0:27 ` Honnappa Nagarahalli 2024-04-24 6:28 ` Mattias Rönnblom 2024-04-24 10:21 ` Ferruh Yigit 2024-04-24 10:28 ` Bruce Richardson 2024-04-24 18:02 ` Ferruh Yigit 2024-04-24 11:57 ` Mattias Rönnblom 2024-04-24 17:50 ` Ferruh Yigit 2024-04-24 19:13 ` Stephen Hemminger 2024-04-24 22:27 ` Mattias Rönnblom 2024-04-24 23:55 ` Stephen Hemminger 2024-04-25 9:26 ` Mattias Rönnblom 2024-04-25 9:49 ` Morten Brørup 2024-04-25 14:04 ` Ferruh Yigit 2024-04-25 15:06 ` Mattias Rönnblom 2024-04-25 16:21 ` Ferruh Yigit 2024-04-25 15:07 ` Stephen Hemminger 2024-04-25 14:08 ` Ferruh Yigit 2024-04-25 15:08 ` Mattias Rönnblom 2024-04-25 15:35 ` Ferruh Yigit 2024-04-26 7:25 ` Mattias Rönnblom 2024-04-26 7:38 ` Mattias Rönnblom 2024-04-26 8:27 ` Ferruh Yigit 2024-04-26 10:20 ` Mattias Rönnblom 2024-04-26 9:05 ` [PATCH v3] " Mattias Rönnblom 2024-04-26 9:22 ` Morten Brørup 2024-04-26 15:10 ` Stephen Hemminger 2024-04-26 15:41 ` Tyler Retzlaff 2024-04-29 8:46 ` Ferruh Yigit 2024-04-26 21:27 ` [PATCH] " Patrick Robb
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).