From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0A9D742808; Mon, 27 Mar 2023 13:45:23 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DBF7640EE1; Mon, 27 Mar 2023 13:45:22 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id B559640ED8 for ; Mon, 27 Mar 2023 13:45:20 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: x86 rte_memcpy_aligned possible optimization Date: Mon, 27 Mar 2023 13:45:17 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8780D@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: x86 rte_memcpy_aligned possible optimization Thread-Index: AdlgoZnHO4M/wehfR1G0r4NexTPImw== From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , "Zhihong Wang" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi Bruce, I think one of the loops in rte_memcpy_aligned() takes one too many = rounds in the case where the catch-up could replace the last round. Consider e.g. n =3D 128: The 64 bytes block copy will take two rounds, and the catch-up will copy = the last 64 bytes once again. I think that the 64 bytes block copy could take only one round and let = the catch-up copy the last 64 bytes. I'm not sure if my suggested method is generally faster than the current = method, so I'm passing the ball. PS: It looks like something similar can be done for the other block copy = loops in this file. I haven't dug into the details. static __rte_always_inline void * rte_memcpy_aligned(void *dst, const void *src, size_t n) { void *ret =3D dst; /* Copy size < 16 bytes */ if (n < 16) { return rte_mov15_or_less(dst, src, n); } /* Copy 16 <=3D size <=3D 32 bytes */ if (n <=3D 32) { rte_mov16((uint8_t *)dst, (const uint8_t *)src); rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); return ret; } /* Copy 32 < size <=3D 64 bytes */ if (n <=3D 64) { rte_mov32((uint8_t *)dst, (const uint8_t *)src); rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); return ret; } /* Copy 64 bytes blocks */ - for (; n >=3D 64; n -=3D 64) { + for (; n > 64; n -=3D 64) { rte_mov64((uint8_t *)dst, (const uint8_t *)src); dst =3D (uint8_t *)dst + 64; src =3D (const uint8_t *)src + 64; } /* Copy whatever left */ rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n); return ret; } Med venlig hilsen / Kind regards, -Morten Br=F8rup