From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9993243EFA; Thu, 25 Apr 2024 00:27:42 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 49AD4402BB; Thu, 25 Apr 2024 00:27:42 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 2B1E14028A for ; Thu, 25 Apr 2024 00:27:41 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 5F1EAC181 for ; Thu, 25 Apr 2024 00:27:40 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 3D4E3C0E3; Thu, 25 Apr 2024 00:27:40 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.3 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id BFEDDC121; Thu, 25 Apr 2024 00:27:36 +0200 (CEST) Message-ID: <2371b1a8-bdc5-4184-8491-54e2e3a64211@lysator.liu.se> Date: Thu, 25 Apr 2024 00:27:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= Subject: Re: [PATCH] net/af_packet: cache align Rx/Tx structs To: Stephen Hemminger , Ferruh Yigit Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , "John W . Linville" , dev@dpdk.org, Tyler Retzlaff , Honnappa Nagarahalli References: <20240423090813.94110-1-mattias.ronnblom@ericsson.com> <6f7aabcb-2c12-4cfe-ae9d-73b42bfd4977@amd.com> <63dbb564-61f6-4d9f-9c13-4a21f5e97dc9@lysator.liu.se> <5d2a0887-527a-4948-943c-65f1dfda9328@amd.com> <3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se> <0ff40e60-926b-44eb-8af5-2e16aff1c336@amd.com> <20240424121330.7547e290@hermes.local> Content-Language: en-US In-Reply-To: <20240424121330.7547e290@hermes.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-04-24 21:13, Stephen Hemminger wrote: > On Wed, 24 Apr 2024 18:50:50 +0100 > Ferruh Yigit wrote: > >>> I don't know how slow af_packet is, but if you care about performance, >>> you don't want to use atomic add for statistics. >>> >> >> There are a few soft drivers already using atomics adds for updating stats. >> If we document expectations from 'rte_eth_stats_reset()', we can update >> those usages. > > Using atomic add is lots of extra overhead. The statistics are not guaranteed > to be perfect. If nothing else, the bytes and packets can be skewed. > The sad thing here is that in case the counters are reset within the load-modify-store cycle of the lcore counter update, the reset may end up being a nop. So, it's not like you missed a packet or two, or suffer some transient inconsistency, but you completed and permanently ignored the reset request. > The soft drivers af_xdp, af_packet, and tun performance is dominated by the > overhead of the kernel system call and copies. Yes, alignment is good > but won't be noticeable. There aren't any syscalls in the RX path in the af_packet PMD. I added the same statistics updates as the af_packet PMD uses into an benchmark app which consumes ~1000 cc in-between stats updates. If the equivalent of the RX queue struct was cache aligned, the statistics overhead was so small it was difficult to measure. Less than 3-4 cc per update. This was with volatile, but without atomics. If the RX queue struct wasn't cache aligned, and sized so a cache line generally was used by two (neighboring) cores, the stats incurred a cost of ~55 cc per update. Shaving off 55 cc should translate to a couple of hundred percent increased performance for an empty af_packet poll. If your lcore has some other primary source of work than the af_packet RX queue, and the RX queue is polled often, then this may well be a noticeable gain. The benchmark was run on 16 Gracemont cores, which in my experience seems to have a little shorter core-to-core latency than many other systems, provided the remote core/cache line owner is located in the same cluster.