From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id C6847A034E
	for <public@inbox.dpdk.org>; Sat, 15 Jan 2022 22:40:48 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B811C41150;
	Sat, 15 Jan 2022 22:40:48 +0100 (CET)
Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com
 [209.85.222.175])
 by mails.dpdk.org (Postfix) with ESMTP id 638CD4013F;
 Sat, 15 Jan 2022 22:40:46 +0100 (CET)
Received: by mail-qk1-f175.google.com with SMTP id u3so4729138qku.1;
 Sat, 15 Jan 2022 13:40:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=DKN5b91/LsO2GhDKf2EywNX+avjQv5jzNpTWqA99K2E=;
 b=fgv9wFesyzntZVYbGlJsQRiV2Buo4QYecPXWIkENGPktwy34i6HaB5CpnGmWXldryN
 CV7hJuh4Lyhq9P6plpCbsZAEtv7FsXYVt7/PSsl6l247HWcqxfRw1aGOkV4LQFJpFXXY
 m7oSh6Qnx3miUchnB3gR+MVLg7wHkL6J5jBgW9vLnBGLd426sAT3KCkdoVyr/6q7Nmqx
 czMEd2T3hUfFRIZG+GZxj6HwUr6kS8gY8l+QFn7OxQKoqExAy5cd0BfZ6n36/w/1fZYI
 L3S5JineAmiFTkb9GKN4BDJzDwnEwCsxOck7cxsAh5Kk1EOYpsZfaMq8dO4DSQ8IQzTM
 hz/A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=DKN5b91/LsO2GhDKf2EywNX+avjQv5jzNpTWqA99K2E=;
 b=wz2BkKkqeUoYTQ7N7iFZTossHURggo31N8c4hWLAqFJiPGcua+aac9LiZXxyw3FGdq
 9Zp0QWEKaFVuI1XdqXxZLppBdnPO9ZETSEkIEbzL184czR9QOltH8D94igrj9gjRcKyE
 I7dEUjpUTbCc469pgXJv3FFqgyvuqKeIE3CKELKb3tmIFDpyFNQSW2vjkh8wjzx8ePqi
 qPzxtzaaJPNPIh0hx9IeREEkVrBJtArTs1zR3PJYKrLARTBgrk7DC5NGf3iS2HtXvX7/
 XqOgQ/NrtYJfJ7lxo4jnuKWXO/4AN73G9erFV7psXk6FiNuFGBnorimhPlEkJRqYfOwR
 7pAg==
X-Gm-Message-State: AOAM530owFmMOlbmi3lSD2Aky/IamPEvDwBFJrvE57NFiHwW/aQOX5kH
 +2sBR5659BG6zSdUbT14WRE=
X-Google-Smtp-Source: ABdhPJw4OXXPgRxvm6s46rz3M6/U8X16qptKWuGaA5GvAhEyeveT6R/fO8mBeg9FNtF9fjigJNlG+A==
X-Received: by 2002:a37:6947:: with SMTP id e68mr4290307qkc.26.1642282845726; 
 Sat, 15 Jan 2022 13:40:45 -0800 (PST)
Received: from localhost.localdomain
 (bras-base-hullpq2034w-grc-18-74-15-213-135.dsl.bell.ca. [74.15.213.135])
 by smtp.gmail.com with ESMTPSA id j22sm6403775qko.46.2022.01.15.13.40.44
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sat, 15 Jan 2022 13:40:45 -0800 (PST)
From: Luc Pelletier <lucp.at.work@gmail.com>
To: bruce.richardson@intel.com,
	konstantin.ananyev@intel.com
Cc: dev@dpdk.org, Luc Pelletier <lucp.at.work@gmail.com>,
 Xiaoyun Li <xiaoyun.li@intel.com>, stable@dpdk.org
Subject: [PATCH v2] eal: fix unaligned loads/stores in rte_memcpy_generic
Date: Sat, 15 Jan 2022 16:39:50 -0500
Message-Id: <20220115213949.449313-1-lucp.at.work@gmail.com>
In-Reply-To: <20220115194102.444140-1-lucp.at.work@gmail.com>
References: <20220115194102.444140-1-lucp.at.work@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-BeenThere: stable@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: patches for DPDK stable branches <stable.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/stable>,
 <mailto:stable-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/stable/>
List-Post: <mailto:stable@dpdk.org>
List-Help: <mailto:stable-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/stable>,
 <mailto:stable-request@dpdk.org?subject=subscribe>
Errors-To: stable-bounces@dpdk.org

Calls to rte_memcpy_generic could result in unaligned loads/stores for
1 < n < 16. This is undefined behavior according to the C standard,
and it gets flagged by the clang undefined behavior sanitizer.

rte_memcpy_generic is called with unaligned src and dst addresses.
When 1 < n < 16, the code would cast both src and dst to a qword,
dword or word pointer, without verifying the alignment of src/dst. The
code was changed to use a for loop to copy the bytes one by one.
Experimentation on compiler explorer indicates that gcc 7+
(released in 2017) and clang 7+ (released in 2018) both optimize out the
for loop with the least number of memory loads and stores, if n is known
at compile-time. If n is only known at compile-time, gcc and clang have
different behaviour but they both seem to recognize that a memcpy is
being done. More recent versions of both gcc/clang seem to also produce
even more optimized results.

Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")
Cc: Xiaoyun Li <xiaoyun.li@intel.com>
Cc: stable@dpdk.org

Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com>
---

I forgot that code under x86 also needs to compile for 32-bit
(obviously). So, I did some more experimentation and replaced the
assembly code with a regular for loop. Explanations are in the updated
commit message. Experimentation was done on compiler explorer here:
https://godbolt.org/z/zK54rzPEn

 lib/eal/x86/include/rte_memcpy.h | 82 ++++++++------------------------
 1 file changed, 20 insertions(+), 62 deletions(-)

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 1b6c6e585f..e422397e49 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -45,6 +45,23 @@ extern "C" {
 static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n);
 
+/**
+ * Copy bytes from one location to another,
+ * locations should not overlap.
+ * Use with unaligned src/dst, and n <= 15.
+ */
+static __rte_always_inline void *
+rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+	for (; n; n--) {
+		*((char *)dst) = *((const char *) src);
+		dst = ((char *)dst) + 1;
+		src = ((const char *)src) + 1;
+	}
+	return ret;
+}
+
 #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
 
 #define ALIGNMENT_MASK 0x3F
@@ -171,8 +188,6 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -181,24 +196,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -379,8 +377,6 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -389,25 +385,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -672,8 +650,6 @@ static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t srcofs;
@@ -682,25 +658,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
-- 
2.25.1