From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0578643EF4;
	Wed, 24 Apr 2024 08:28:54 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id E5CF6433D5;
	Wed, 24 Apr 2024 08:28:53 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id D628E433C0
 for <dev@dpdk.org>; Wed, 24 Apr 2024 08:28:51 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id A88D68917
 for <dev@dpdk.org>; Wed, 24 Apr 2024 08:28:50 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 8BA628916; Wed, 24 Apr 2024 08:28:50 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.3
Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 25B6A88B4;
 Wed, 24 Apr 2024 08:28:46 +0200 (CEST)
Message-ID: <1cc4884d-515c-4bb3-90f3-bf0f03205ae7@lysator.liu.se>
Date: Wed, 24 Apr 2024 08:28:45 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] net/af_packet: cache align Rx/Tx structs
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: Ferruh Yigit <ferruh.yigit@amd.com>,
 =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>,
 "John W . Linville" <linville@tuxdriver.com>, "dev@dpdk.org" <dev@dpdk.org>,
 Tyler Retzlaff <roretzla@linux.microsoft.com>, nd <nd@arm.com>
References: <20240423090813.94110-1-mattias.ronnblom@ericsson.com>
 <6f7aabcb-2c12-4cfe-ae9d-73b42bfd4977@amd.com>
 <63dbb564-61f6-4d9f-9c13-4a21f5e97dc9@lysator.liu.se>
 <4E55A056-C269-4DEA-B702-1979BF66E574@arm.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <4E55A056-C269-4DEA-B702-1979BF66E574@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-04-24 02:27, Honnappa Nagarahalli wrote:
> 
> 
>> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>
>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>> makes no change there, but it does on 32-bit ISAs.
>>>>
>>>> TX struct is 56 bytes on x86_64.
>>>>
>>> Hi Mattias,
>>> No objection to the patch. Is the improvement theoretical or do you
>>> measure any improvement practically, if so how much is the improvement?
>>
>> I didn't run any benchmarks.
>>
>> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications.
>>
>>>> Both structs keep counters, and in the RX case they are updated even
>>>> for empty polls.
>>>>
>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>> the loop?
>>
>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment.
>>
>> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store).
> (Slightly unrelated discussion)
> Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed.
> 

Is it "far atomics" you have in mind? A C11 complaint compiler won't 
generate STADD (even for relaxed type ops), since it doesn't fit well 
into the C11 memory model. In particular, the issue is that STADD isn't 
ordered by the instructions generated by C11 fences, from what I understand.

GCC did generate STADD for a while, until this issue was discovered.

On x86_64, ADD <addr> is generally much faster than LOCK ADD <addr>. No 
wonder, since the latter is a full barrier.

If I'd worked for ARM I would probably had proposed some API extension 
to DPDK atomics API to allow the use of STADD (through inline 
assembler). One way to integrate it might be to add a new memory model, 
rte_memory_order_unordered or rte_memory_order_jumps_all_fences, where 
loads and stores are totally unordered.

However, in this particular case, it's a single-writer scenario, so 
rte_memory_order_kangaroo wouldn't really help, since it would be 
equivalent to rte_memory_order_relaxed on non-far atomics-capable ISAs, 
which would be too expensive. In the past I've argued for a 
single-writer version of atomic add/sub etc in the DPDK atomics API 
(whatever that is), for convenience and to make the issue/pattern known.

>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>    * All rights reserved.
>>>>    */
>>>>   +#include <rte_common.h>
>>>>   #include <rte_string_fns.h>
>>>>   #include <rte_mbuf.h>
>>>>   #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>      volatile unsigned long rx_pkts;
>>>>    volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>   
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>