From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B0113456E6; Mon, 29 Jul 2024 21:27:05 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7BB6C40E0C; Mon, 29 Jul 2024 21:27:05 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 69F1F40E0A for ; Mon, 29 Jul 2024 21:27:04 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id A465D7715 for ; Mon, 29 Jul 2024 21:27:03 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 97E2A76A5; Mon, 29 Jul 2024 21:27:03 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.3 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id E91C67661; Mon, 29 Jul 2024 21:27:01 +0200 (CEST) Message-ID: Date: Mon, 29 Jul 2024 21:27:01 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v5 6/6] vhost: optimize memcpy routines when cc memcpy is used To: =?UTF-8?Q?Morten_Br=C3=B8rup?= , =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , dev@dpdk.org Cc: Stephen Hemminger , David Marchand , Pavan Nikhilesh , Bruce Richardson References: <20240620175731.420639-2-mattias.ronnblom@ericsson.com> <20240724075357.546248-1-mattias.ronnblom@ericsson.com> <20240724075357.546248-7-mattias.ronnblom@ericsson.com> <98CBD80474FA8B44BF855DF32C47DC35E9F5B8@smartserver.smartshare.dk> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F5B8@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-07-29 13:00, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com] >> Sent: Wednesday, 24 July 2024 09.54 > > Which packet mix was used for your tests? Synthetic IMIX, or some live data? > I used the same test as was being done when the performance regression was demonstrated (i.e., 2x testpmd with fixed packet size). >> +/* The code generated by GCC (and to a lesser extent, clang) with just >> + * a straight memcpy() to copy packets is less than optimal on Intel >> + * P-cores, for small packets. Thus the need of this specialized >> + * memcpy() in builds where use_cc_memcpy is set to true. >> + */ >> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64) >> +static __rte_always_inline void >> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len) >> +{ >> + void *dst = __builtin_assume_aligned(in_dst, 16); >> + const void *src = __builtin_assume_aligned(in_src, 16); >> + >> + if (len <= 256) { >> + size_t left; >> + >> + for (left = len; left >= 32; left -= 32) { >> + memcpy(dst, src, 32); >> + dst = RTE_PTR_ADD(dst, 32); >> + src = RTE_PTR_ADD(src, 32); >> + } >> + >> + memcpy(dst, src, left); >> + } else > > Although the packets within a burst often have similar size, I'm not sure you can rely on the dynamic branch predictor here. > I agree that the pktcpy() routine will likely often suffer a size-related branch mispredict with real packet size mix. A benchmark with a real packet mix would be much better than the tests I've run. This needs to be compared, of course, with the overhead imposed by conditionals included in other pktcpy() implementations. > Looking at the ethdev packet size counters at an ISP (at the core of their Layer 3 network), 71 % are 256 byte or larger [1]. > > For static branch prediction, I would consider > 256 more likely and swap the two branches, i.e. compare (len > 256) instead of (len <= 256). > OK, I'll add likely() instead, to make it more explicit. > But again: I don't know how the dynamic branch predictor behaves here. Perhaps my suggested change makes no difference. > I think it will, but it will be tiny. From what I understand, even when the branch prediction guessed correctly, one receive a slight benefit if the branch is not taken. >> + memcpy(dst, src, len); >> +} > > With or without suggested change, > Acked-by: Morten Brørup > > > [1]: Details (incl. one VLAN tag) > tx_size_64_packets 1,1 % > tx_size_65_to_127_packets 25,7 % > tx_size_128_to_255_packets 2,6 % > tx_size_256_to_511_packets 1,4 % > tx_size_512_to_1023_packets 1,7 % > tx_size_1024_to_1522_packets 67,6 % >