DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH v1 0/4] implementation of ML common code
@ 2022-12-08 19:35 Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 1/4] common/ml: add initial files for " Srikanth Yalavarthi
                   ` (7 more replies)
  0 siblings, 8 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-08 19:35 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, only NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.

Srikanth Yalavarthi (4):
  common/ml: add initial files for ML common code
  common/ml: add data type conversion routines
  common/ml: add generic type conversion functions
  common/ml: add Arm NEON type conversion routines

 MAINTAINERS                          |   8 +
 drivers/common/meson.build           |   1 +
 drivers/common/ml/meson.build        |  27 +
 drivers/common/ml/ml_utils.c         | 238 +++++++
 drivers/common/ml/ml_utils.h         | 283 ++++++++
 drivers/common/ml/ml_utils_generic.c | 716 ++++++++++++++++++++
 drivers/common/ml/ml_utils_generic.h |  23 +
 drivers/common/ml/ml_utils_neon.c    | 950 +++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_neon.h    |  23 +
 drivers/common/ml/version.map        |  25 +
 10 files changed, 2294 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h
 create mode 100644 drivers/common/ml/ml_utils_generic.c
 create mode 100644 drivers/common/ml/ml_utils_generic.h
 create mode 100644 drivers/common/ml/ml_utils_neon.c
 create mode 100644 drivers/common/ml/ml_utils_neon.h
 create mode 100644 drivers/common/ml/version.map

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 1/4] common/ml: add initial files for ML common code
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
@ 2022-12-08 19:35 ` Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 2/4] common/ml: add data type conversion routines Srikanth Yalavarthi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-08 19:35 UTC (permalink / raw)
  To: Thomas Monjalon, Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added initial files for common ML driver code. Implemented ML
type to size conversion, type to string and format to string
conversion utility functions.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26046 ("app/mldev: implement test framework for mldev")

 MAINTAINERS                   |   8 +++
 drivers/common/meson.build    |   1 +
 drivers/common/ml/meson.build |  20 +++++++
 drivers/common/ml/ml_utils.c  | 110 ++++++++++++++++++++++++++++++++++
 drivers/common/ml/ml_utils.h  |  50 ++++++++++++++++
 drivers/common/ml/version.map |   9 +++
 6 files changed, 198 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h
 create mode 100644 drivers/common/ml/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 5fa276fafa..6412209bff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1431,6 +1431,14 @@ F: drivers/raw/dpaa2_cmdif/
 F: doc/guides/rawdevs/dpaa2_cmdif.rst


+ML Device Drivers
+------------------------
+
+ML common code
+M: Srikanth Yalavarthi <syalavarthi@marvell.com>
+F: drivers/common/ml/
+
+
 Packet processing
 -----------------

diff --git a/drivers/common/meson.build b/drivers/common/meson.build
index b63d899d50..0878dde0a0 100644
--- a/drivers/common/meson.build
+++ b/drivers/common/meson.build
@@ -9,4 +9,5 @@ drivers = [
         'idpf',
         'mvep',
         'octeontx',
+        'ml',
 ]
diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
new file mode 100644
index 0000000000..2749ab6c2e
--- /dev/null
+++ b/drivers/common/ml/meson.build
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright (c) 2022 Marvell.
+
+if not is_linux or not dpdk_conf.get('RTE_ARCH_64')
+    build = false
+    reason = 'only supported on 64-bit Linux'
+    subdir_done()
+endif
+
+headers = files(
+        'ml_utils.h',
+)
+
+sources = files(
+        'ml_utils.c',
+)
+
+deps += ['mldev']
+
+pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
new file mode 100644
index 0000000000..45c1f76a54
--- /dev/null
+++ b/drivers/common/ml/ml_utils.c
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <rte_mldev.h>
+
+#include "ml_utils.h"
+
+int
+ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/drivers/common/ml/ml_utils.h b/drivers/common/ml/ml_utils.h
new file mode 100644
index 0000000000..b6adb98e04
--- /dev/null
+++ b/drivers/common/ml/ml_utils.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_H_
+#define _ML_UTILS_H_
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+#endif /*_ML_UTILS_H_ */
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
new file mode 100644
index 0000000000..7e33755f2f
--- /dev/null
+++ b/drivers/common/ml/version.map
@@ -0,0 +1,9 @@
+INTERNAL {
+	global:
+
+	ml_io_type_size_get;
+	ml_io_type_to_str;
+	ml_io_format_to_str;
+
+	local: *;
+};
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 2/4] common/ml: add data type conversion routines
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 1/4] common/ml: add initial files for " Srikanth Yalavarthi
@ 2022-12-08 19:35 ` Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 3/4] common/ml: add generic type conversion functions Srikanth Yalavarthi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-08 19:35 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Type conversion routines transform data from higher to lower
precision data types or vice-versa. These conversion functions
can be used by the ML driver implementations for quantization
and de-quantization.

Added driver routines for type conversion. These driver routines
invoke the architecture specific functions.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 drivers/common/ml/ml_utils.c  | 132 +++++++++++++++++++
 drivers/common/ml/ml_utils.h  | 233 ++++++++++++++++++++++++++++++++++
 drivers/common/ml/version.map |  16 +++
 3 files changed, 381 insertions(+)

diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index 45c1f76a54..553e906172 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -2,6 +2,10 @@
  * Copyright (c) 2022 Marvell.
  */
 
+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_common.h>
 #include <rte_mldev.h>
 
 #include "ml_utils.h"
@@ -108,3 +112,131 @@ ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
 		rte_strlcpy(str, "invalid", len);
 	}
 }
+
+int
+ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(scale);
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
+
+int
+ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	RTE_SET_USED(nb_elements);
+	RTE_SET_USED(input);
+	RTE_SET_USED(output);
+
+	return -ENOTSUP;
+}
diff --git a/drivers/common/ml/ml_utils.h b/drivers/common/ml/ml_utils.h
index b6adb98e04..9726c6e3b5 100644
--- a/drivers/common/ml/ml_utils.h
+++ b/drivers/common/ml/ml_utils.h
@@ -47,4 +47,237 @@ void ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
 __rte_internal
 void ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
 
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
 #endif /*_ML_UTILS_H_ */
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
index 7e33755f2f..35f270f637 100644
--- a/drivers/common/ml/version.map
+++ b/drivers/common/ml/version.map
@@ -5,5 +5,21 @@ INTERNAL {
 	ml_io_type_to_str;
 	ml_io_format_to_str;
 
+	ml_float32_to_int8;
+	ml_int8_to_float32;
+	ml_float32_to_uint8;
+	ml_uint8_to_float32;
+
+	ml_float32_to_int16;
+	ml_int16_to_float32;
+	ml_float32_to_uint16;
+	ml_uint16_to_float32;
+
+	ml_float32_to_float16;
+	ml_float16_to_float32;
+
+	ml_float32_to_bfloat16;
+	ml_bfloat16_to_float32;
+
 	local: *;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 3/4] common/ml: add generic type conversion functions
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 1/4] common/ml: add initial files for " Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 2/4] common/ml: add data type conversion routines Srikanth Yalavarthi
@ 2022-12-08 19:35 ` Srikanth Yalavarthi
  2022-12-08 19:35 ` [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-08 19:35 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added generic implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 drivers/common/ml/meson.build        |   2 +
 drivers/common/ml/ml_utils.c         |  86 +---
 drivers/common/ml/ml_utils_generic.c | 716 +++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_generic.h |  23 +
 4 files changed, 758 insertions(+), 69 deletions(-)
 create mode 100644 drivers/common/ml/ml_utils_generic.c
 create mode 100644 drivers/common/ml/ml_utils_generic.h

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 2749ab6c2e..84ae84ee4e 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -9,10 +9,12 @@ endif
 
 headers = files(
         'ml_utils.h',
+        'ml_utils_generic.h',
 )
 
 sources = files(
         'ml_utils.c',
+        'ml_utils_generic.c',
 )
 
 deps += ['mldev']
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index 553e906172..e2edef0904 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -5,10 +5,14 @@
 #include <errno.h>
 #include <stdint.h>
 
-#include <rte_common.h>
 #include <rte_mldev.h>
 
 #include "ml_utils.h"
+#include "ml_utils_generic.h"
+
+#if defined(__ARM_NEON__)
+#include "ml_utils_neon.h"
+#endif
 
 int
 ml_io_type_size_get(enum rte_ml_io_type type)
@@ -116,127 +120,71 @@ ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
 int
 ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_int8_to_float32_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_uint8_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_uint8_to_float32_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_int16_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_int16_to_float32_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_uint16_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(scale);
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_uint16_to_float32_generic(scale, nb_elements, input, output);
 }
 
 int
 ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_float16_generic(nb_elements, input, output);
 }
 
 int
 ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float16_to_float32_generic(nb_elements, input, output);
 }
 
 int
 ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_float32_to_bfloat16_generic(nb_elements, input, output);
 }
 
 int
 ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
-	RTE_SET_USED(nb_elements);
-	RTE_SET_USED(input);
-	RTE_SET_USED(output);
-
-	return -ENOTSUP;
+	return ml_bfloat16_to_float32_generic(nb_elements, input, output);
 }
diff --git a/drivers/common/ml/ml_utils_generic.c b/drivers/common/ml/ml_utils_generic.c
new file mode 100644
index 0000000000..ab67a2ac7f
--- /dev/null
+++ b/drivers/common/ml/ml_utils_generic.c
@@ -0,0 +1,716 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "ml_utils.h"
+#include "ml_utils_generic.h"
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* BFLOAT16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* BFLOAT16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+int
+ml_float32_to_int8_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_int8_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_float32_to_uint8_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_uint8_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_float32_to_int16_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_int16_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_float32_to_uint16_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+int
+ml_uint16_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_generic_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+int
+ml_float32_to_float16_generic(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_generic_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_generic_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+int
+ml_float16_to_float32_generic(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_generic_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_generic_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+int
+ml_float32_to_bfloat16_generic(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_generic_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_generic_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+int
+ml_bfloat16_to_float32_generic(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_generic_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/drivers/common/ml/ml_utils_generic.h b/drivers/common/ml/ml_utils_generic.h
new file mode 100644
index 0000000000..9d47d8466e
--- /dev/null
+++ b/drivers/common/ml/ml_utils_generic.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_GENERIC_H_
+#define _ML_UTILS_GENERIC_H_
+
+#include <stdint.h>
+
+int ml_float32_to_int8_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int8_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint8_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint8_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_int16_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int16_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint16_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint16_to_float32_generic(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_float16_generic(uint64_t nb_elements, void *input, void *output);
+int ml_float16_to_float32_generic(uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_bfloat16_generic(uint64_t nb_elements, void *input, void *output);
+int ml_bfloat16_to_float32_generic(uint64_t nb_elements, void *input, void *output);
+
+#endif /*_ML_UTILS_GENERIC_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (2 preceding siblings ...)
  2022-12-08 19:35 ` [PATCH v1 3/4] common/ml: add generic type conversion functions Srikanth Yalavarthi
@ 2022-12-08 19:35 ` Srikanth Yalavarthi
  2022-12-12  7:16   ` Ruifeng Wang
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-08 19:35 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 drivers/common/ml/meson.build     |   5 +
 drivers/common/ml/ml_utils.c      |  48 ++
 drivers/common/ml/ml_utils_neon.c | 950 ++++++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_neon.h |  23 +
 4 files changed, 1026 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_neon.c
 create mode 100644 drivers/common/ml/ml_utils_neon.h

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 84ae84ee4e..f7ce19b4b4 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -17,6 +17,11 @@ sources = files(
         'ml_utils_generic.c',
 )
 
+if arch_subdir == 'arm'
+    headers += files('ml_utils_neon.h')
+    sources += files('ml_utils_neon.c')
+endif
+
 deps += ['mldev']
 
 pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index e2edef0904..3edcf09fde 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -120,71 +120,119 @@ ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
 int
 ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_int8_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_int8_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_uint8_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_uint8_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_uint8_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_uint8_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_int16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_int16_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_int16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_int16_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_uint16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_uint16_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_uint16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_uint16_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_float16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_float16_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_float16_to_float32_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_FEATURE_BF16)
+	return ml_float32_to_bfloat16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_bfloat16_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_FEATURE_BF16)
+	return ml_bfloat16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_bfloat16_to_float32_generic(nb_elements, input, output);
+#endif
 }
diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
new file mode 100644
index 0000000000..b660de07ec
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.c
@@ -0,0 +1,950 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include <rte_common.h>
+#include <rte_vect.h>
+
+#include "ml_utils.h"
+#include "ml_utils_neon.h"
+
+#include <arm_neon.h>
+
+static void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int32x4_t vmin;
+	int32x4_t vmax;
+	int8x8_t s8x8;
+
+	/* set constants */
+	vmin = vdupq_n_s32(INT8_MIN);
+	vmax = vdupq_n_s32(INT8_MAX);
+
+	/* load 4 float32 elements, scale, convert, update ranges and narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+	s16x4_l = vmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, update ranges and narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+	s16x4_h = vmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	float32x2_t f32x2;
+	int32x2_t s32x2;
+	int32x2_t vmin;
+	int32x2_t vmax;
+	int8x8_t s8x8;
+
+	/* set constants */
+	vmin = vdup_n_s32(INT8_MIN);
+	vmax = vdup_n_s32(INT8_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert with use round to nearest with ties away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [INT8_MIN:INT8_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+	s32x2 = vmax_s32(s32x2, vmin);
+
+	/* convert to int8_t */
+	s8x8 = vreinterpret_s8_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_s8(output, s8x8, 0);
+}
+
+int
+ml_float32_to_int8_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint32_t batch_size;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	batch_size = 2 * sizeof(float) / sizeof(int8_t);
+
+	/* convert batch_size elements in each iteration */
+	for (i = 0; i < (nb_elements / batch_size); i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += batch_size;
+		output_buffer += batch_size;
+	}
+
+	/* convert leftover elements */
+	i = i * batch_size;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint32x4_t vmax;
+	uint8x8_t u8x8;
+
+	/* set constants */
+	vmax = vdupq_n_u32(UINT8_MAX);
+
+	/* load 4 float elements, scale, convert, update range and narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u32x4 = vminq_u32(u32x4, vmax);
+	u16x4_l = vmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, update range and narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u32x4 = vminq_u32(u32x4, vmax);
+	u16x4_h = vmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	float32x2_t f32x2;
+	uint32x2_t u32x2;
+	uint32x2_t vmax;
+	uint8x8_t u8x8;
+
+	/* set constants */
+	vmax = vdup_n_u32(UINT8_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert to uin32_t using round to nearest with ties away rounding mode */
+	u32x2 = vcvta_u32_f32(f32x2);
+
+	/* update range [0:UINT8_MAX] */
+	u32x2 = vmin_u32(u32x2, vmax);
+
+	/* convert to uint8x8_t */
+	u8x8 = vreinterpret_u8_u32(u32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_u8(output, u8x8, 0);
+}
+
+int
+ml_float32_to_uint8_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int32x4_t vmin;
+	int32x4_t vmax;
+
+	/* set constants */
+	vmin = vdupq_n_s32(INT16_MIN);
+	vmax = vdupq_n_s32(INT16_MAX);
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* update range [INT16_MIN:INT16_MAX] */
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+
+	/* narrow to int16x4_t */
+	s16x4 = vmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	float32x2_t f32x2;
+	int32x2_t s32x2;
+	int16x4_t s16x4;
+	int32x2_t vmin;
+	int32x2_t vmax;
+
+	/* set constants */
+	vmin = vdup_n_s32(INT16_MIN);
+	vmax = vdup_n_s32(INT16_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [INT16_MIN:INT16_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+	s32x2 = vmax_s32(s32x2, vmin);
+
+	/* convert to int16x4_t */
+	s16x4 = vreinterpret_s16_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_s16(output, s16x4, 0);
+}
+
+int
+ml_float32_to_int16_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint32x4_t vmax;
+
+	/* set constants */
+	vmax = vdupq_n_u32(UINT16_MAX);
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* update range [0:UINT16_MAX] */
+	u32x4 = vminq_u32(u32x4, vmax);
+
+	/* narrow */
+	u16x4 = vmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	float32x2_t f32x2;
+	uint16x4_t u16x4;
+	int32x2_t s32x2;
+	int32x2_t vmax;
+
+	/* set constants */
+	vmax = vdup_n_s32(UINT16_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [0:UINT16_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+
+	/* convert to uint16x4_t */
+	u16x4 = vreinterpret_u16_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_u16(output, u16x4, 0);
+}
+
+int
+ml_float32_to_uint16_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+ml_float32_to_float16_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_float16_to_float32_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+ml_float32_to_bfloat16_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_bfloat16_to_float32_neon(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
diff --git a/drivers/common/ml/ml_utils_neon.h b/drivers/common/ml/ml_utils_neon.h
new file mode 100644
index 0000000000..d912049779
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_NEON_H_
+#define _ML_UTILS_NEON_H_
+
+#include <stdint.h>
+
+int ml_float32_to_int8_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint8_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_int16_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint16_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_float16_neon(uint64_t nb_elements, void *input, void *output);
+int ml_float16_to_float32_neon(uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_bfloat16_neon(uint64_t nb_elements, void *input, void *output);
+int ml_bfloat16_to_float32_neon(uint64_t nb_elements, void *input, void *output);
+
+#endif /*_ML_UTILS_NEON_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-08 19:35 ` [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-12  7:16   ` Ruifeng Wang
  2022-12-12 17:25     ` Srikanth Yalavarthi
  0 siblings, 1 reply; 59+ messages in thread
From: Ruifeng Wang @ 2022-12-12  7:16 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, nd

> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Friday, December 9, 2022 3:36 AM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com; aprabhu@marvell.com
> Subject: [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines
> 
> Added ARM NEON intrinsic based implementations to support conversion of data types.
> Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16
> types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  drivers/common/ml/meson.build     |   5 +
>  drivers/common/ml/ml_utils.c      |  48 ++
>  drivers/common/ml/ml_utils_neon.c | 950 ++++++++++++++++++++++++++++++
> drivers/common/ml/ml_utils_neon.h |  23 +
>  4 files changed, 1026 insertions(+)
>  create mode 100644 drivers/common/ml/ml_utils_neon.c  create mode 100644
> drivers/common/ml/ml_utils_neon.h
> 
> diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build index
> 84ae84ee4e..f7ce19b4b4 100644
> --- a/drivers/common/ml/meson.build
> +++ b/drivers/common/ml/meson.build
> @@ -17,6 +17,11 @@ sources = files(
>          'ml_utils_generic.c',
>  )
> 
> +if arch_subdir == 'arm'
> +    headers += files('ml_utils_neon.h')
> +    sources += files('ml_utils_neon.c') endif
> +
>  deps += ['mldev']
> 
>  pmd_supports_disable_iova_as_pa = true
> diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c index
> e2edef0904..3edcf09fde 100644
> --- a/drivers/common/ml/ml_utils.c
> +++ b/drivers/common/ml/ml_utils.c
> @@ -120,71 +120,119 @@ ml_io_format_to_str(enum rte_ml_io_format format, char *str, int
> len)  int  ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
> {
> +#if defined(__ARM_NEON__)
> +	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> +#else
>  	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
> +#endif
>  }
> 
Maybe __rte_weak can be used to remove the ifdef clutter.

Something like:
ml_utils.c
__rte_weak int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
{
	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
}
ml_utis_neon.c
int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
{
	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
}

<snip>
> diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
> new file mode 100644
> index 0000000000..b660de07ec
> --- /dev/null
> +++ b/drivers/common/ml/ml_utils_neon.c
> @@ -0,0 +1,950 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2022 Marvell.
> + */
> +
> +#include <errno.h>
> +#include <math.h>
> +#include <stdint.h>
> +
> +#include <rte_common.h>
> +#include <rte_vect.h>
> +
> +#include "ml_utils.h"
> +#include "ml_utils_neon.h"
> +
> +#include <arm_neon.h>
This line can be removed. It is included rte_vect.h.

Thanks.
<snip>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 0/4] implementation of ML common code
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (3 preceding siblings ...)
  2022-12-08 19:35 ` [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-12 17:21 ` Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 1/4] common/ml: add initial files for " Srikanth Yalavarthi
                     ` (4 more replies)
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
                   ` (2 subsequent siblings)
  7 siblings, 5 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:21 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, vector NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.


Srikanth Yalavarthi (4):
  common/ml: add initial files for ML common code
  common/ml: add common utility functions
  common/ml: add scalar type conversion functions
  common/ml: add Arm NEON type conversion routines

 MAINTAINERS                         |   8 +
 drivers/common/meson.build          |   1 +
 drivers/common/ml/meson.build       |  25 +
 drivers/common/ml/ml_utils.c        | 118 ++++
 drivers/common/ml/ml_utils.h        | 283 +++++++++
 drivers/common/ml/ml_utils_neon.c   | 873 ++++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_scalar.c | 720 +++++++++++++++++++++++
 drivers/common/ml/version.map       |  25 +
 8 files changed, 2053 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h
 create mode 100644 drivers/common/ml/ml_utils_neon.c
 create mode 100644 drivers/common/ml/ml_utils_scalar.c
 create mode 100644 drivers/common/ml/version.map

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 1/4] common/ml: add initial files for ML common code
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
@ 2022-12-12 17:21   ` Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 2/4] common/ml: add common utility functions Srikanth Yalavarthi
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:21 UTC (permalink / raw)
  To: Thomas Monjalon, Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ML common header files and skeleton code. Common ML code
includes utility routines to convert ML IO type and format to
string, IO type to size and routines to convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26046 ("app/mldev: implement test framework for mldev")

v2:
* Moved implementation out of patch. Only headers are included.

 MAINTAINERS                   |   8 +
 drivers/common/meson.build    |   1 +
 drivers/common/ml/meson.build |  20 +++
 drivers/common/ml/ml_utils.c  |   5 +
 drivers/common/ml/ml_utils.h  | 283 ++++++++++++++++++++++++++++++++++
 5 files changed, 317 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5fa276fafa..6412209bff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1431,6 +1431,14 @@ F: drivers/raw/dpaa2_cmdif/
 F: doc/guides/rawdevs/dpaa2_cmdif.rst


+ML Device Drivers
+------------------------
+
+ML common code
+M: Srikanth Yalavarthi <syalavarthi@marvell.com>
+F: drivers/common/ml/
+
+
 Packet processing
 -----------------

diff --git a/drivers/common/meson.build b/drivers/common/meson.build
index b63d899d50..0878dde0a0 100644
--- a/drivers/common/meson.build
+++ b/drivers/common/meson.build
@@ -9,4 +9,5 @@ drivers = [
         'idpf',
         'mvep',
         'octeontx',
+        'ml',
 ]
diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
new file mode 100644
index 0000000000..2749ab6c2e
--- /dev/null
+++ b/drivers/common/ml/meson.build
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright (c) 2022 Marvell.
+
+if not is_linux or not dpdk_conf.get('RTE_ARCH_64')
+    build = false
+    reason = 'only supported on 64-bit Linux'
+    subdir_done()
+endif
+
+headers = files(
+        'ml_utils.h',
+)
+
+sources = files(
+        'ml_utils.c',
+)
+
+deps += ['mldev']
+
+pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
new file mode 100644
index 0000000000..90bc280e4b
--- /dev/null
+++ b/drivers/common/ml/ml_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "ml_utils.h"
diff --git a/drivers/common/ml/ml_utils.h b/drivers/common/ml/ml_utils.h
new file mode 100644
index 0000000000..9726c6e3b5
--- /dev/null
+++ b/drivers/common/ml/ml_utils.h
@@ -0,0 +1,283 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_H_
+#define _ML_UTILS_H_
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#endif /*_ML_UTILS_H_ */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 2/4] common/ml: add common utility functions
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 1/4] common/ml: add initial files for " Srikanth Yalavarthi
@ 2022-12-12 17:21   ` Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:21 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Implemented ML common utility functions to convert IO data type to
name, IO format to name and routine to get the size of an IO data
type in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Implemented common utility functions as part of the patch
* Dropped use of driver routines for data conversion functions

 drivers/common/ml/ml_utils.c  | 113 ++++++++++++++++++++++++++++++++++
 drivers/common/ml/version.map |   9 +++
 2 files changed, 122 insertions(+)
 create mode 100644 drivers/common/ml/version.map

diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index 90bc280e4b..59753c5468 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */

+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "ml_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
new file mode 100644
index 0000000000..7e33755f2f
--- /dev/null
+++ b/drivers/common/ml/version.map
@@ -0,0 +1,9 @@
+INTERNAL {
+	global:
+
+	ml_io_type_size_get;
+	ml_io_type_to_str;
+	ml_io_format_to_str;
+
+	local: *;
+};
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 3/4] common/ml: add scalar type conversion functions
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 1/4] common/ml: add initial files for " Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 2/4] common/ml: add common utility functions Srikanth Yalavarthi
@ 2022-12-12 17:21   ` Srikanth Yalavarthi
  2022-12-12 17:21   ` [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:21 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Updated internal function names
* Updated function attributes to __rte_weak

 drivers/common/ml/meson.build       |   1 +
 drivers/common/ml/ml_utils_scalar.c | 720 ++++++++++++++++++++++++++++
 drivers/common/ml/version.map       |  16 +
 3 files changed, 737 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_scalar.c

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 2749ab6c2e..59b58b95b4 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -13,6 +13,7 @@ headers = files(

 sources = files(
         'ml_utils.c',
+        'ml_utils_scalar.c',
 )

 deps += ['mldev']
diff --git a/drivers/common/ml/ml_utils_scalar.c b/drivers/common/ml/ml_utils_scalar.c
new file mode 100644
index 0000000000..1272d67593
--- /dev/null
+++ b/drivers/common/ml/ml_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "ml_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
index 7e33755f2f..35f270f637 100644
--- a/drivers/common/ml/version.map
+++ b/drivers/common/ml/version.map
@@ -5,5 +5,21 @@ INTERNAL {
 	ml_io_type_to_str;
 	ml_io_format_to_str;

+	ml_float32_to_int8;
+	ml_int8_to_float32;
+	ml_float32_to_uint8;
+	ml_uint8_to_float32;
+
+	ml_float32_to_int16;
+	ml_int16_to_float32;
+	ml_float32_to_uint16;
+	ml_uint16_to_float32;
+
+	ml_float32_to_float16;
+	ml_float16_to_float32;
+
+	ml_float32_to_bfloat16;
+	ml_bfloat16_to_float32;
+
 	local: *;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                     ` (2 preceding siblings ...)
  2022-12-12 17:21   ` [PATCH v2 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
@ 2022-12-12 17:21   ` Srikanth Yalavarthi
  2022-12-13  9:04     ` Ruifeng Wang
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
  4 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:21 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Dropped use of driver routines to call neon functions
* Optimization of neon functions. Reduce the number of intrinsic calls.

 drivers/common/ml/meson.build     |   4 +
 drivers/common/ml/ml_utils_neon.c | 873 ++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_neon.c

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 59b58b95b4..7939cb7a64 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -16,6 +16,10 @@ sources = files(
         'ml_utils_scalar.c',
 )

+if arch_subdir == 'arm'
+    sources += files('ml_utils_neon.c')
+endif
+
 deps += ['mldev']

 pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
new file mode 100644
index 0000000000..4acf13123c
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "ml_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-12  7:16   ` Ruifeng Wang
@ 2022-12-12 17:25     ` Srikanth Yalavarthi
  0 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-12 17:25 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Anup Prabhu, nd, Srikanth Yalavarthi

> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: 12 December 2022 12:46
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>; nd
> <nd@arm.com>; Srikanth Yalavarthi <syalavarthi@marvell.com>
> Subject: [EXT] RE: [PATCH v1 4/4] common/ml: add Arm NEON type
> conversion routines
> 
> External Email
> 
> ----------------------------------------------------------------------
> > -----Original Message-----
> > From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> > Sent: Friday, December 9, 2022 3:36 AM
> > To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>
> > Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com;
> > aprabhu@marvell.com
> > Subject: [PATCH v1 4/4] common/ml: add Arm NEON type conversion
> > routines
> >
> > Added ARM NEON intrinsic based implementations to support conversion
> of data types.
> > Support is enabled to handle int8, uint8, int16, uint16, float16,
> > float32 and bfloat16 types.
> >
> > Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> > ---
> >  drivers/common/ml/meson.build     |   5 +
> >  drivers/common/ml/ml_utils.c      |  48 ++
> >  drivers/common/ml/ml_utils_neon.c | 950
> > ++++++++++++++++++++++++++++++ drivers/common/ml/ml_utils_neon.h
> |  23
> > +
> >  4 files changed, 1026 insertions(+)
> >  create mode 100644 drivers/common/ml/ml_utils_neon.c  create mode
> > 100644 drivers/common/ml/ml_utils_neon.h
> >
> > diff --git a/drivers/common/ml/meson.build
> > b/drivers/common/ml/meson.build index
> > 84ae84ee4e..f7ce19b4b4 100644
> > --- a/drivers/common/ml/meson.build
> > +++ b/drivers/common/ml/meson.build
> > @@ -17,6 +17,11 @@ sources = files(
> >          'ml_utils_generic.c',
> >  )
> >
> > +if arch_subdir == 'arm'
> > +    headers += files('ml_utils_neon.h')
> > +    sources += files('ml_utils_neon.c') endif
> > +
> >  deps += ['mldev']
> >
> >  pmd_supports_disable_iova_as_pa = true diff --git
> > a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c index
> > e2edef0904..3edcf09fde 100644
> > --- a/drivers/common/ml/ml_utils.c
> > +++ b/drivers/common/ml/ml_utils.c
> > @@ -120,71 +120,119 @@ ml_io_format_to_str(enum rte_ml_io_format
> > format, char *str, int
> > len)  int  ml_float32_to_int8(float scale, uint64_t nb_elements, void
> > *input, void *output) {
> > +#if defined(__ARM_NEON__)
> > +	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> > +#else
> >  	return ml_float32_to_int8_generic(scale, nb_elements, input,
> > output);
> > +#endif
> >  }
> >
> Maybe __rte_weak can be used to remove the ifdef clutter.
> 
> Something like:
> ml_utils.c
> __rte_weak int ml_float32_to_int8(float scale, uint64_t nb_elements, void
> *input, void *output) {
> 	return ml_float32_to_int8_generic(scale, nb_elements, input,
> output); } ml_utis_neon.c int ml_float32_to_int8(float scale, uint64_t
> nb_elements, void *input, void *output) {
> 	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> }
> 
Updated the common/ml series implementation. Scalar / generic routines would be weak symbols.

> <snip>
> > diff --git a/drivers/common/ml/ml_utils_neon.c
> > b/drivers/common/ml/ml_utils_neon.c
> > new file mode 100644
> > index 0000000000..b660de07ec
> > --- /dev/null
> > +++ b/drivers/common/ml/ml_utils_neon.c
> > @@ -0,0 +1,950 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2022 Marvell.
> > + */
> > +
> > +#include <errno.h>
> > +#include <math.h>
> > +#include <stdint.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_vect.h>
> > +
> > +#include "ml_utils.h"
> > +#include "ml_utils_neon.h"
> > +
> > +#include <arm_neon.h>
> This line can be removed. It is included rte_vect.h.
Done
> 
> Thanks.
> <snip>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-12 17:21   ` [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-13  9:04     ` Ruifeng Wang
  0 siblings, 0 replies; 59+ messages in thread
From: Ruifeng Wang @ 2022-12-13  9:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, nd

> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Tuesday, December 13, 2022 1:21 AM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com; aprabhu@marvell.com
> Subject: [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines
> 
> Added ARM NEON intrinsic based implementations to support conversion of data types.
> Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16
> types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
> v2:
> * Dropped use of driver routines to call neon functions
> * Optimization of neon functions. Reduce the number of intrinsic calls.
> 
>  drivers/common/ml/meson.build     |   4 +
>  drivers/common/ml/ml_utils_neon.c | 873 ++++++++++++++++++++++++++++++
>  2 files changed, 877 insertions(+)
>  create mode 100644 drivers/common/ml/ml_utils_neon.c
> 
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 0/4] implementation of ML common code
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                     ` (3 preceding siblings ...)
  2022-12-12 17:21   ` [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-20 17:52   ` Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 1/4] common/ml: add initial files for " Srikanth Yalavarthi
                       ` (5 more replies)
  4 siblings, 6 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 17:52 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, vector NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.


Srikanth Yalavarthi (4):
  common/ml: add initial files for ML common code
  common/ml: add common utility functions
  common/ml: add scalar type conversion functions
  common/ml: add Arm NEON type conversion routines

 MAINTAINERS                         |   8 +
 drivers/common/meson.build          |   1 +
 drivers/common/ml/meson.build       |  25 +
 drivers/common/ml/ml_utils.c        | 118 ++++
 drivers/common/ml/ml_utils.h        | 283 +++++++++
 drivers/common/ml/ml_utils_neon.c   | 873 ++++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_scalar.c | 720 +++++++++++++++++++++++
 drivers/common/ml/version.map       |  25 +
 8 files changed, 2053 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h
 create mode 100644 drivers/common/ml/ml_utils_neon.c
 create mode 100644 drivers/common/ml/ml_utils_scalar.c
 create mode 100644 drivers/common/ml/version.map

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 1/4] common/ml: add initial files for ML common code
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
@ 2022-12-20 17:52     ` Srikanth Yalavarthi
  2022-12-20 19:04       ` Stephen Hemminger
  2022-12-20 17:52     ` [PATCH v3 2/4] common/ml: add common utility functions Srikanth Yalavarthi
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 17:52 UTC (permalink / raw)
  To: Thomas Monjalon, Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ML common header files and skeleton code. Common ML code
includes utility routines to convert ML IO type and format to
string, IO type to size and routines to convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26046 ("app/mldev: implement test framework for mldev")

v3:
* Skip installation of internal common/ml headers

v2:
* Moved implementation out of patch. Only headers are included.

 MAINTAINERS                   |   8 +
 drivers/common/meson.build    |   1 +
 drivers/common/ml/meson.build |  20 +++
 drivers/common/ml/ml_utils.c  |   5 +
 drivers/common/ml/ml_utils.h  | 283 ++++++++++++++++++++++++++++++++++
 5 files changed, 317 insertions(+)
 create mode 100644 drivers/common/ml/meson.build
 create mode 100644 drivers/common/ml/ml_utils.c
 create mode 100644 drivers/common/ml/ml_utils.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5fa276fafa..6412209bff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1431,6 +1431,14 @@ F: drivers/raw/dpaa2_cmdif/
 F: doc/guides/rawdevs/dpaa2_cmdif.rst


+ML Device Drivers
+------------------------
+
+ML common code
+M: Srikanth Yalavarthi <syalavarthi@marvell.com>
+F: drivers/common/ml/
+
+
 Packet processing
 -----------------

diff --git a/drivers/common/meson.build b/drivers/common/meson.build
index b63d899d50..0878dde0a0 100644
--- a/drivers/common/meson.build
+++ b/drivers/common/meson.build
@@ -9,4 +9,5 @@ drivers = [
         'idpf',
         'mvep',
         'octeontx',
+        'ml',
 ]
diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
new file mode 100644
index 0000000000..b0ecc42668
--- /dev/null
+++ b/drivers/common/ml/meson.build
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright (c) 2022 Marvell.
+
+if not is_linux or not dpdk_conf.get('RTE_ARCH_64')
+    build = false
+    reason = 'only supported on 64-bit Linux'
+    subdir_done()
+endif
+
+driver_sdk_headers = files(
+        'ml_utils.h',
+)
+
+sources = files(
+        'ml_utils.c',
+)
+
+deps += ['mldev']
+
+pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
new file mode 100644
index 0000000000..90bc280e4b
--- /dev/null
+++ b/drivers/common/ml/ml_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "ml_utils.h"
diff --git a/drivers/common/ml/ml_utils.h b/drivers/common/ml/ml_utils.h
new file mode 100644
index 0000000000..9726c6e3b5
--- /dev/null
+++ b/drivers/common/ml/ml_utils.h
@@ -0,0 +1,283 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_H_
+#define _ML_UTILS_H_
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#endif /*_ML_UTILS_H_ */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 2/4] common/ml: add common utility functions
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 1/4] common/ml: add initial files for " Srikanth Yalavarthi
@ 2022-12-20 17:52     ` Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 17:52 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Implemented ML common utility functions to convert IO data type to
name, IO format to name and routine to get the size of an IO data
type in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Implemented common utility functions as part of the patch
* Dropped use of driver routines for data conversion functions

 drivers/common/ml/ml_utils.c  | 113 ++++++++++++++++++++++++++++++++++
 drivers/common/ml/version.map |   9 +++
 2 files changed, 122 insertions(+)
 create mode 100644 drivers/common/ml/version.map

diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index 90bc280e4b..59753c5468 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */

+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "ml_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
new file mode 100644
index 0000000000..7e33755f2f
--- /dev/null
+++ b/drivers/common/ml/version.map
@@ -0,0 +1,9 @@
+INTERNAL {
+	global:
+
+	ml_io_type_size_get;
+	ml_io_type_to_str;
+	ml_io_format_to_str;
+
+	local: *;
+};
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 3/4] common/ml: add scalar type conversion functions
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 1/4] common/ml: add initial files for " Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 2/4] common/ml: add common utility functions Srikanth Yalavarthi
@ 2022-12-20 17:52     ` Srikanth Yalavarthi
  2022-12-20 17:52     ` [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 17:52 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Updated internal function names
* Updated function attributes to __rte_weak

 drivers/common/ml/meson.build       |   1 +
 drivers/common/ml/ml_utils_scalar.c | 720 ++++++++++++++++++++++++++++
 drivers/common/ml/version.map       |  16 +
 3 files changed, 737 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_scalar.c

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index b0ecc42668..271aa9c33a 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -13,6 +13,7 @@ driver_sdk_headers = files(

 sources = files(
         'ml_utils.c',
+        'ml_utils_scalar.c',
 )

 deps += ['mldev']
diff --git a/drivers/common/ml/ml_utils_scalar.c b/drivers/common/ml/ml_utils_scalar.c
new file mode 100644
index 0000000000..1272d67593
--- /dev/null
+++ b/drivers/common/ml/ml_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "ml_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/drivers/common/ml/version.map b/drivers/common/ml/version.map
index 7e33755f2f..35f270f637 100644
--- a/drivers/common/ml/version.map
+++ b/drivers/common/ml/version.map
@@ -5,5 +5,21 @@ INTERNAL {
 	ml_io_type_to_str;
 	ml_io_format_to_str;

+	ml_float32_to_int8;
+	ml_int8_to_float32;
+	ml_float32_to_uint8;
+	ml_uint8_to_float32;
+
+	ml_float32_to_int16;
+	ml_int16_to_float32;
+	ml_float32_to_uint16;
+	ml_uint16_to_float32;
+
+	ml_float32_to_float16;
+	ml_float16_to_float32;
+
+	ml_float32_to_bfloat16;
+	ml_bfloat16_to_float32;
+
 	local: *;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
                       ` (2 preceding siblings ...)
  2022-12-20 17:52     ` [PATCH v3 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
@ 2022-12-20 17:52     ` Srikanth Yalavarthi
  2022-12-21  3:08       ` Ruifeng Wang
  2022-12-20 19:06     ` [PATCH v3 0/4] implementation of ML common code Stephen Hemminger
  2023-01-25 13:18     ` Thomas Monjalon
  5 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 17:52 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v2:
* Dropped use of driver routines to call neon functions
* Optimization of neon functions. Reduce the number of intrinsic calls.

 drivers/common/ml/meson.build     |   4 +
 drivers/common/ml/ml_utils_neon.c | 873 ++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_neon.c

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 271aa9c33a..22139bb8ae 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -16,6 +16,10 @@ sources = files(
         'ml_utils_scalar.c',
 )

+if arch_subdir == 'arm'
+    sources += files('ml_utils_neon.c')
+endif
+
 deps += ['mldev']

 pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
new file mode 100644
index 0000000000..4acf13123c
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "ml_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/4] common/ml: add initial files for ML common code
  2022-12-20 17:52     ` [PATCH v3 1/4] common/ml: add initial files for " Srikanth Yalavarthi
@ 2022-12-20 19:04       ` Stephen Hemminger
  2022-12-20 19:19         ` [EXT] " Srikanth Yalavarthi
  0 siblings, 1 reply; 59+ messages in thread
From: Stephen Hemminger @ 2022-12-20 19:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: Thomas Monjalon, dev, sshankarnara, jerinj, aprabhu

On Tue, 20 Dec 2022 09:52:53 -0800
Srikanth Yalavarthi <syalavarthi@marvell.com> wrote:

> diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
> new file mode 100644
> index 0000000000..b0ecc42668
> --- /dev/null
> +++ b/drivers/common/ml/meson.build
> @@ -0,0 +1,20 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright (c) 2022 Marvell.
> +
> +if not is_linux or not dpdk_conf.get('RTE_ARCH_64')
> +    build = false
> +    reason = 'only supported on 64-bit Linux'
> +    subdir_done()
> +endif
> +

Why only x86? and why only linux?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 0/4] implementation of ML common code
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
                       ` (3 preceding siblings ...)
  2022-12-20 17:52     ` [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-20 19:06     ` Stephen Hemminger
  2022-12-20 19:17       ` [EXT] " Srikanth Yalavarthi
  2023-01-25 13:18     ` Thomas Monjalon
  5 siblings, 1 reply; 59+ messages in thread
From: Stephen Hemminger @ 2022-12-20 19:06 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

On Tue, 20 Dec 2022 09:52:52 -0800
Srikanth Yalavarthi <syalavarthi@marvell.com> wrote:

> Machine Learning common code
> ----------------------------
> 
> This patch series implements the common ML code that can be used by
> ML drivers. Common code include functions to convert ML IO type to
> string, IO format type to string, function get size of ML IO type,
> and functions for converting data types from higher precision to
> lower precision and vice-versa.
> 
> Data type conversion functions support handling float32, float16,
> bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
> functions are implemented in the series, generic scalar version and
> vector version using Arm NEON intrinsics. When compiling DPDK for
> platform supporting Arm NEON, vector NEON version of the routines would
> be enabled. Compilation would fall back to generic scalar versions on
> platform like x86_64 / PowerPC etc., that don't support Arm NEON.
> 
> 
> Srikanth Yalavarthi (4):
>   common/ml: add initial files for ML common code
>   common/ml: add common utility functions
>   common/ml: add scalar type conversion functions
>   common/ml: add Arm NEON type conversion routines


Ok, but much more is needed.

Where is the documentation?
Where are the tests?
Where is an example?

Need a driver that uses it

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2022-12-20 19:06     ` [PATCH v3 0/4] implementation of ML common code Stephen Hemminger
@ 2022-12-20 19:17       ` Srikanth Yalavarthi
  0 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 19:17 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Anup Prabhu, Srikanth Yalavarthi

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 21 December 2022 00:37
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> 
> External Email
> 
> ----------------------------------------------------------------------
> On Tue, 20 Dec 2022 09:52:52 -0800
> Srikanth Yalavarthi <syalavarthi@marvell.com> wrote:
> 
> > Machine Learning common code
> > ----------------------------
> >
> > This patch series implements the common ML code that can be used by ML
> > drivers. Common code include functions to convert ML IO type to
> > string, IO format type to string, function get size of ML IO type, and
> > functions for converting data types from higher precision to lower
> > precision and vice-versa.
> >
> > Data type conversion functions support handling float32, float16,
> > bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
> > functions are implemented in the series, generic scalar version and
> > vector version using Arm NEON intrinsics. When compiling DPDK for
> > platform supporting Arm NEON, vector NEON version of the routines
> > would be enabled. Compilation would fall back to generic scalar
> > versions on platform like x86_64 / PowerPC etc., that don't support Arm
> NEON.
> >
> >
> > Srikanth Yalavarthi (4):
> >   common/ml: add initial files for ML common code
> >   common/ml: add common utility functions
> >   common/ml: add scalar type conversion functions
> >   common/ml: add Arm NEON type conversion routines
> 
> 
> Ok, but much more is needed.
> 
> Where is the documentation?
Documentation related to the functions is part of the header files. Doxygen documentation is not added  as the functions / macros part of common/ml is non RTE code.

> Where are the tests?
> Where is an example?
We don't intend to implement unit tests or examples for the common/ml code as it is intended for driver use only.

> 
> Need a driver that uses it
ml/cnxk driver uses the functions part of common/ml code. Driver is pushed as part of a separate patch series
http://patches.dpdk.org/project/dpdk/list/?series=26050
http://patches.dpdk.org/project/dpdk/patch/20221208201806.21893-23-syalavarthi@marvell.com/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 1/4] common/ml: add initial files for ML common code
  2022-12-20 19:04       ` Stephen Hemminger
@ 2022-12-20 19:19         ` Srikanth Yalavarthi
  0 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2022-12-20 19:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Thomas Monjalon, dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Anup Prabhu, Srikanth Yalavarthi

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 21 December 2022 00:35
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: Thomas Monjalon <thomas@monjalon.net>; dev@dpdk.org; Shivah
> Shankar Shankar Narayan Rao <sshankarnara@marvell.com>; Jerin Jacob
> Kollanukkaran <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [EXT] Re: [PATCH v3 1/4] common/ml: add initial files for ML
> common code
> 
> External Email
> 
> ----------------------------------------------------------------------
> On Tue, 20 Dec 2022 09:52:53 -0800
> Srikanth Yalavarthi <syalavarthi@marvell.com> wrote:
> 
> > diff --git a/drivers/common/ml/meson.build
> > b/drivers/common/ml/meson.build new file mode 100644 index
> > 0000000000..b0ecc42668
> > --- /dev/null
> > +++ b/drivers/common/ml/meson.build
> > @@ -0,0 +1,20 @@
> > +# SPDX-License-Identifier: BSD-3-Clause # Copyright (c) 2022 Marvell.
> > +
> > +if not is_linux or not dpdk_conf.get('RTE_ARCH_64')
> > +    build = false
> > +    reason = 'only supported on 64-bit Linux'
> > +    subdir_done()
> > +endif
> > +
> 
> Why only x86? and why only linux?
Common/ml is added as a dependency to ml/cnxk driver which is currently supported on Linux only.
We can enable Windows support if needed at a later stage.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines
  2022-12-20 17:52     ` [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2022-12-21  3:08       ` Ruifeng Wang
  0 siblings, 0 replies; 59+ messages in thread
From: Ruifeng Wang @ 2022-12-21  3:08 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, nd

> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Wednesday, December 21, 2022 1:53 AM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com; aprabhu@marvell.com
> Subject: [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines
> 
> Added ARM NEON intrinsic based implementations to support conversion of data types.
> Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16
> types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
> v2:
> * Dropped use of driver routines to call neon functions
> * Optimization of neon functions. Reduce the number of intrinsic calls.
> 
>  drivers/common/ml/meson.build     |   4 +
>  drivers/common/ml/ml_utils_neon.c | 873 ++++++++++++++++++++++++++++++
>  2 files changed, 877 insertions(+)
>  create mode 100644 drivers/common/ml/ml_utils_neon.c
> 
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 0/4] implementation of ML common code
  2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
                       ` (4 preceding siblings ...)
  2022-12-20 19:06     ` [PATCH v3 0/4] implementation of ML common code Stephen Hemminger
@ 2023-01-25 13:18     ` Thomas Monjalon
  2023-01-25 13:25       ` [EXT] " Srikanth Yalavarthi
  5 siblings, 1 reply; 59+ messages in thread
From: Thomas Monjalon @ 2023-01-25 13:18 UTC (permalink / raw)
  To: sshankarnara
  Cc: dev, jerinj, aprabhu, Srikanth Yalavarthi, ferruh.yigit,
	bruce.richardson, david.marchand

20/12/2022 18:52, Srikanth Yalavarthi:
> Machine Learning common code
> ----------------------------
> 
> This patch series implements the common ML code that can be used by
> ML drivers. Common code include functions to convert ML IO type to
> string, IO format type to string, function get size of ML IO type,
> and functions for converting data types from higher precision to
> lower precision and vice-versa.

I'm not sure about the path of this code.
In general we implement drivers helper in the same directory as the driver
and mark it as internal.
Would it work here?

>  drivers/common/meson.build          |   1 +
>  drivers/common/ml/meson.build       |  25 +
>  drivers/common/ml/ml_utils.c        | 118 ++++
>  drivers/common/ml/ml_utils.h        | 283 +++++++++
>  drivers/common/ml/ml_utils_neon.c   | 873 ++++++++++++++++++++++++++++
>  drivers/common/ml/ml_utils_scalar.c | 720 +++++++++++++++++++++++
>  drivers/common/ml/version.map       |  25 +




^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-25 13:18     ` Thomas Monjalon
@ 2023-01-25 13:25       ` Srikanth Yalavarthi
  2023-01-25 13:55         ` Thomas Monjalon
  0 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-01-25 13:25 UTC (permalink / raw)
  To: Thomas Monjalon, Shivah Shankar Shankar Narayan Rao
  Cc: dev, Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi

Hi,


> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: 25 January 2023 18:48
> To: Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>
> Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup
> Prabhu <aprabhu@marvell.com>; Srikanth Yalavarthi
> <syalavarthi@marvell.com>; ferruh.yigit@amd.com;
> bruce.richardson@intel.com; david.marchand@redhat.com
> Subject: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> 
> External Email
> 
> ----------------------------------------------------------------------
> 20/12/2022 18:52, Srikanth Yalavarthi:
> > Machine Learning common code
> > ----------------------------
> >
> > This patch series implements the common ML code that can be used by ML
> > drivers. Common code include functions to convert ML IO type to
> > string, IO format type to string, function get size of ML IO type, and
> > functions for converting data types from higher precision to lower
> > precision and vice-versa.
> 
> I'm not sure about the path of this code.
> In general we implement drivers helper in the same directory as the driver
> and mark it as internal.
> Would it work here?

We are planning to implement two different ML drivers, ml/cnxk driver (submitted for review) and a software only driver (part of ML roadmap and currently WIP). Both the drivers would be using these common functions for quantization and dequantization. Hence, placed the files in common/ml directory.

Moreover, these functions are used to convert data from higher to lower precision or vice-versa and  can also be used by future ML drivers for other platforms.

> 
> >  drivers/common/meson.build          |   1 +
> >  drivers/common/ml/meson.build       |  25 +
> >  drivers/common/ml/ml_utils.c        | 118 ++++
> >  drivers/common/ml/ml_utils.h        | 283 +++++++++
> >  drivers/common/ml/ml_utils_neon.c   | 873
> ++++++++++++++++++++++++++++
> >  drivers/common/ml/ml_utils_scalar.c | 720 +++++++++++++++++++++++
> >  drivers/common/ml/version.map       |  25 +
> 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-25 13:25       ` [EXT] " Srikanth Yalavarthi
@ 2023-01-25 13:55         ` Thomas Monjalon
  2023-01-25 14:59           ` Srikanth Yalavarthi
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Monjalon @ 2023-01-25 13:55 UTC (permalink / raw)
  To: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi
  Cc: dev, Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi

25/01/2023 14:25, Srikanth Yalavarthi:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > This patch series implements the common ML code that can be used by ML
> > > drivers. Common code include functions to convert ML IO type to
> > > string, IO format type to string, function get size of ML IO type, and
> > > functions for converting data types from higher precision to lower
> > > precision and vice-versa.
> > 
> > I'm not sure about the path of this code.
> > In general we implement drivers helper in the same directory as the driver
> > and mark it as internal.
> > Would it work here?
> 
> We are planning to implement two different ML drivers, ml/cnxk driver (submitted for review) and a software only driver (part of ML roadmap and currently WIP). Both the drivers would be using these common functions for quantization and dequantization. Hence, placed the files in common/ml directory.
> 
> Moreover, these functions are used to convert data from higher to lower precision or vice-versa and  can also be used by future ML drivers for other platforms.

I understand, and what you say does not contradict
with having this code in lib/mldev/.
So would you agree to move?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-25 13:55         ` Thomas Monjalon
@ 2023-01-25 14:59           ` Srikanth Yalavarthi
  2023-01-26 10:57             ` Thomas Monjalon
  0 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-01-25 14:59 UTC (permalink / raw)
  To: Thomas Monjalon, Shivah Shankar Shankar Narayan Rao
  Cc: dev, Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: 25 January 2023 19:25
> To: Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>;
> Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup
> Prabhu <aprabhu@marvell.com>; ferruh.yigit@amd.com;
> bruce.richardson@intel.com; david.marchand@redhat.com; Srikanth
> Yalavarthi <syalavarthi@marvell.com>
> Subject: Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> 
> 25/01/2023 14:25, Srikanth Yalavarthi:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > This patch series implements the common ML code that can be used
> > > > by ML drivers. Common code include functions to convert ML IO type
> > > > to string, IO format type to string, function get size of ML IO
> > > > type, and functions for converting data types from higher
> > > > precision to lower precision and vice-versa.
> > >
> > > I'm not sure about the path of this code.
> > > In general we implement drivers helper in the same directory as the
> > > driver and mark it as internal.
> > > Would it work here?
> >
> > We are planning to implement two different ML drivers, ml/cnxk driver
> (submitted for review) and a software only driver (part of ML roadmap and
> currently WIP). Both the drivers would be using these common functions for
> quantization and dequantization. Hence, placed the files in common/ml
> directory.
> >
> > Moreover, these functions are used to convert data from higher to lower
> precision or vice-versa and  can also be used by future ML drivers for other
> platforms.
> 
> I understand, and what you say does not contradict with having this code in
> lib/mldev/.
> So would you agree to move?
> 

These common functions do not have an rte_ml_dev_ prefix.
Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-25 14:59           ` Srikanth Yalavarthi
@ 2023-01-26 10:57             ` Thomas Monjalon
  2023-01-27  6:40               ` Jerin Jacob
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Monjalon @ 2023-01-26 10:57 UTC (permalink / raw)
  To: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi
  Cc: dev, Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi

25/01/2023 15:59, Srikanth Yalavarthi:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > This patch series implements the common ML code that can be used
> > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > to string, IO format type to string, function get size of ML IO
> > > > > type, and functions for converting data types from higher
> > > > > precision to lower precision and vice-versa.
> > > >
> > > > I'm not sure about the path of this code.
> > > > In general we implement drivers helper in the same directory as the
> > > > driver and mark it as internal.
> > > > Would it work here?
> > >
> > > We are planning to implement two different ML drivers, ml/cnxk driver
> > (submitted for review) and a software only driver (part of ML roadmap and
> > currently WIP). Both the drivers would be using these common functions for
> > quantization and dequantization. Hence, placed the files in common/ml
> > directory.
> > >
> > > Moreover, these functions are used to convert data from higher to lower
> > precision or vice-versa and  can also be used by future ML drivers for other
> > platforms.
> > 
> > I understand, and what you say does not contradict with having this code in
> > lib/mldev/.
> > So would you agree to move?
> 
> These common functions do not have an rte_ml_dev_ prefix.

As it is exported, it should have rte_ prefix.

> Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.

Look at lib/ethdev/ethdev_driver.h, it should be similar.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-26 10:57             ` Thomas Monjalon
@ 2023-01-27  6:40               ` Jerin Jacob
  2023-01-27  8:50                 ` Thomas Monjalon
  0 siblings, 1 reply; 59+ messages in thread
From: Jerin Jacob @ 2023-01-27  6:40 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand

On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 25/01/2023 15:59, Srikanth Yalavarthi:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > This patch series implements the common ML code that can be used
> > > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > > to string, IO format type to string, function get size of ML IO
> > > > > > type, and functions for converting data types from higher
> > > > > > precision to lower precision and vice-versa.
> > > > >
> > > > > I'm not sure about the path of this code.
> > > > > In general we implement drivers helper in the same directory as the
> > > > > driver and mark it as internal.
> > > > > Would it work here?
> > > >
> > > > We are planning to implement two different ML drivers, ml/cnxk driver
> > > (submitted for review) and a software only driver (part of ML roadmap and
> > > currently WIP). Both the drivers would be using these common functions for
> > > quantization and dequantization. Hence, placed the files in common/ml
> > > directory.
> > > >
> > > > Moreover, these functions are used to convert data from higher to lower
> > > precision or vice-versa and  can also be used by future ML drivers for other
> > > platforms.
> > >
> > > I understand, and what you say does not contradict with having this code in
> > > lib/mldev/.
> > > So would you agree to move?
> >
> > These common functions do not have an rte_ml_dev_ prefix.
>
> As it is exported, it should have rte_ prefix.

The exposed functions are similar to lib/ethdev/sff_* where multiple
driver can "use" it
but not by application directly.
If so, What is the recommendation
a) Keeping driver/common/ml without rte_prefix
b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?

I prefer (a) as it will not pollute lib/mldev. No strong opinion,
either. Let me know your view or any other suggestion?

>
> > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.
>
> Look at lib/ethdev/ethdev_driver.h, it should be similar.

Here scope is different. See above.

>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-27  6:40               ` Jerin Jacob
@ 2023-01-27  8:50                 ` Thomas Monjalon
  2023-01-27  9:02                   ` Jerin Jacob
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Monjalon @ 2023-01-27  8:50 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand

27/01/2023 07:40, Jerin Jacob:
> On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > This patch series implements the common ML code that can be used
> > > > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > > > to string, IO format type to string, function get size of ML IO
> > > > > > > type, and functions for converting data types from higher
> > > > > > > precision to lower precision and vice-versa.
> > > > > >
> > > > > > I'm not sure about the path of this code.
> > > > > > In general we implement drivers helper in the same directory as the
> > > > > > driver and mark it as internal.
> > > > > > Would it work here?
> > > > >
> > > > > We are planning to implement two different ML drivers, ml/cnxk driver
> > > > (submitted for review) and a software only driver (part of ML roadmap and
> > > > currently WIP). Both the drivers would be using these common functions for
> > > > quantization and dequantization. Hence, placed the files in common/ml
> > > > directory.
> > > > >
> > > > > Moreover, these functions are used to convert data from higher to lower
> > > > precision or vice-versa and  can also be used by future ML drivers for other
> > > > platforms.
> > > >
> > > > I understand, and what you say does not contradict with having this code in
> > > > lib/mldev/.
> > > > So would you agree to move?
> > >
> > > These common functions do not have an rte_ml_dev_ prefix.
> >
> > As it is exported, it should have rte_ prefix.
> 
> The exposed functions are similar to lib/ethdev/sff_* where multiple
> driver can "use" it
> but not by application directly.
> If so, What is the recommendation
> a) Keeping driver/common/ml without rte_prefix
> b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> 
> I prefer (a) as it will not pollute lib/mldev. No strong opinion,
> either. Let me know your view or any other suggestion?

I don't see it as pollution, it comes with the library,
so I prefer lib/mldev/ with rte_mldev_pmd_ prefix.


> > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.
> >
> > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> 
> Here scope is different. See above.

No the scope is not different.
They are functions used by drivers not by application.




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-27  8:50                 ` Thomas Monjalon
@ 2023-01-27  9:02                   ` Jerin Jacob
  2023-01-27  9:26                     ` Thomas Monjalon
  0 siblings, 1 reply; 59+ messages in thread
From: Jerin Jacob @ 2023-01-27  9:02 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand

On Fri, Jan 27, 2023 at 2:20 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 27/01/2023 07:40, Jerin Jacob:
> > On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > > This patch series implements the common ML code that can be used
> > > > > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > > > > to string, IO format type to string, function get size of ML IO
> > > > > > > > type, and functions for converting data types from higher
> > > > > > > > precision to lower precision and vice-versa.
> > > > > > >
> > > > > > > I'm not sure about the path of this code.
> > > > > > > In general we implement drivers helper in the same directory as the
> > > > > > > driver and mark it as internal.
> > > > > > > Would it work here?
> > > > > >
> > > > > > We are planning to implement two different ML drivers, ml/cnxk driver
> > > > > (submitted for review) and a software only driver (part of ML roadmap and
> > > > > currently WIP). Both the drivers would be using these common functions for
> > > > > quantization and dequantization. Hence, placed the files in common/ml
> > > > > directory.
> > > > > >
> > > > > > Moreover, these functions are used to convert data from higher to lower
> > > > > precision or vice-versa and  can also be used by future ML drivers for other
> > > > > platforms.
> > > > >
> > > > > I understand, and what you say does not contradict with having this code in
> > > > > lib/mldev/.
> > > > > So would you agree to move?
> > > >
> > > > These common functions do not have an rte_ml_dev_ prefix.
> > >
> > > As it is exported, it should have rte_ prefix.
> >
> > The exposed functions are similar to lib/ethdev/sff_* where multiple
> > driver can "use" it
> > but not by application directly.
> > If so, What is the recommendation
> > a) Keeping driver/common/ml without rte_prefix
> > b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> >
> > I prefer (a) as it will not pollute lib/mldev. No strong opinion,
> > either. Let me know your view or any other suggestion?
>
> I don't see it as pollution, it comes with the library,
> so I prefer lib/mldev/ with rte_mldev_pmd_ prefix.
>
>
> > > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.
> > >
> > > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> >
> > Here scope is different. See above.
>
> No the scope is not different.
> They are functions used by drivers not by application.

When you say lib/ethdev/ethdev_driver.h. You mean "struct eth_dev_ops" scheme.
That is already there for public ml dev APIs. See
http://patches.dpdk.org/project/dpdk/patch/20221114120238.2143832-4-jerinj@marvell.com/

Here it meant to be functions which is not backed any function
pointers as those are "generic"
Utils for drivers(not driver specific). If so, did you prefer
rte_mldev_pmd_ in lib/mldev/ ?



>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-27  9:02                   ` Jerin Jacob
@ 2023-01-27  9:26                     ` Thomas Monjalon
  2023-01-27 10:28                       ` Jerin Jacob
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Monjalon @ 2023-01-27  9:26 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand

27/01/2023 10:02, Jerin Jacob:
> On Fri, Jan 27, 2023 at 2:20 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 27/01/2023 07:40, Jerin Jacob:
> > > On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > > > This patch series implements the common ML code that can be used
> > > > > > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > > > > > to string, IO format type to string, function get size of ML IO
> > > > > > > > > type, and functions for converting data types from higher
> > > > > > > > > precision to lower precision and vice-versa.
> > > > > > > >
> > > > > > > > I'm not sure about the path of this code.
> > > > > > > > In general we implement drivers helper in the same directory as the
> > > > > > > > driver and mark it as internal.
> > > > > > > > Would it work here?
> > > > > > >
> > > > > > > We are planning to implement two different ML drivers, ml/cnxk driver
> > > > > > (submitted for review) and a software only driver (part of ML roadmap and
> > > > > > currently WIP). Both the drivers would be using these common functions for
> > > > > > quantization and dequantization. Hence, placed the files in common/ml
> > > > > > directory.
> > > > > > >
> > > > > > > Moreover, these functions are used to convert data from higher to lower
> > > > > > precision or vice-versa and  can also be used by future ML drivers for other
> > > > > > platforms.
> > > > > >
> > > > > > I understand, and what you say does not contradict with having this code in
> > > > > > lib/mldev/.
> > > > > > So would you agree to move?
> > > > >
> > > > > These common functions do not have an rte_ml_dev_ prefix.
> > > >
> > > > As it is exported, it should have rte_ prefix.
> > >
> > > The exposed functions are similar to lib/ethdev/sff_* where multiple
> > > driver can "use" it
> > > but not by application directly.
> > > If so, What is the recommendation
> > > a) Keeping driver/common/ml without rte_prefix
> > > b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> > >
> > > I prefer (a) as it will not pollute lib/mldev. No strong opinion,
> > > either. Let me know your view or any other suggestion?
> >
> > I don't see it as pollution, it comes with the library,
> > so I prefer lib/mldev/ with rte_mldev_pmd_ prefix.
> >
> >
> > > > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.
> > > >
> > > > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> > >
> > > Here scope is different. See above.
> >
> > No the scope is not different.
> > They are functions used by drivers not by application.
> 
> When you say lib/ethdev/ethdev_driver.h. You mean "struct eth_dev_ops" scheme.

No I don't mean that. Did you check the internal functions in this file?
I mean functions like rte_eth_dev_allocate() or rte_eth_dev_attach_secondary().




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-27  9:26                     ` Thomas Monjalon
@ 2023-01-27 10:28                       ` Jerin Jacob
  2023-01-31 13:44                         ` Srikanth Yalavarthi
  0 siblings, 1 reply; 59+ messages in thread
From: Jerin Jacob @ 2023-01-27 10:28 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Shivah Shankar Shankar Narayan Rao, Srikanth Yalavarthi, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand

On Fri, Jan 27, 2023 at 2:56 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 27/01/2023 10:02, Jerin Jacob:
> > On Fri, Jan 27, 2023 at 2:20 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > 27/01/2023 07:40, Jerin Jacob:
> > > > On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > > > > This patch series implements the common ML code that can be used
> > > > > > > > > > by ML drivers. Common code include functions to convert ML IO type
> > > > > > > > > > to string, IO format type to string, function get size of ML IO
> > > > > > > > > > type, and functions for converting data types from higher
> > > > > > > > > > precision to lower precision and vice-versa.
> > > > > > > > >
> > > > > > > > > I'm not sure about the path of this code.
> > > > > > > > > In general we implement drivers helper in the same directory as the
> > > > > > > > > driver and mark it as internal.
> > > > > > > > > Would it work here?
> > > > > > > >
> > > > > > > > We are planning to implement two different ML drivers, ml/cnxk driver
> > > > > > > (submitted for review) and a software only driver (part of ML roadmap and
> > > > > > > currently WIP). Both the drivers would be using these common functions for
> > > > > > > quantization and dequantization. Hence, placed the files in common/ml
> > > > > > > directory.
> > > > > > > >
> > > > > > > > Moreover, these functions are used to convert data from higher to lower
> > > > > > > precision or vice-versa and  can also be used by future ML drivers for other
> > > > > > > platforms.
> > > > > > >
> > > > > > > I understand, and what you say does not contradict with having this code in
> > > > > > > lib/mldev/.
> > > > > > > So would you agree to move?
> > > > > >
> > > > > > These common functions do not have an rte_ml_dev_ prefix.
> > > > >
> > > > > As it is exported, it should have rte_ prefix.
> > > >
> > > > The exposed functions are similar to lib/ethdev/sff_* where multiple
> > > > driver can "use" it
> > > > but not by application directly.
> > > > If so, What is the recommendation
> > > > a) Keeping driver/common/ml without rte_prefix
> > > > b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> > > >
> > > > I prefer (a) as it will not pollute lib/mldev. No strong opinion,
> > > > either. Let me know your view or any other suggestion?
> > >
> > > I don't see it as pollution, it comes with the library,
> > > so I prefer lib/mldev/ with rte_mldev_pmd_ prefix.
> > >
> > >
> > > > > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to lib/mldev.
> > > > >
> > > > > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> > > >
> > > > Here scope is different. See above.
> > >
> > > No the scope is not different.
> > > They are functions used by drivers not by application.
> >
> > When you say lib/ethdev/ethdev_driver.h. You mean "struct eth_dev_ops" scheme.
>
> No I don't mean that. Did you check the internal functions in this file?
> I mean functions like rte_eth_dev_allocate() or rte_eth_dev_attach_secondary().

Got it. Let's change to rte_ml_pmd_ prefix and add to lib/mldev then.

>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-27 10:28                       ` Jerin Jacob
@ 2023-01-31 13:44                         ` Srikanth Yalavarthi
  2023-02-01  9:15                           ` Srikanth Yalavarthi
  0 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-01-31 13:44 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: Shivah Shankar Shankar Narayan Rao, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: 27 January 2023 15:58
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>;
> Srikanth Yalavarthi <syalavarthi@marvell.com>; dev@dpdk.org; Jerin Jacob
> Kollanukkaran <jerinj@marvell.com>; Anup Prabhu
> <aprabhu@marvell.com>; ferruh.yigit@amd.com;
> bruce.richardson@intel.com; david.marchand@redhat.com; Srikanth
> Yalavarthi <syalavarthi@marvell.com>
> Subject: Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> 
> On Fri, Jan 27, 2023 at 2:56 PM Thomas Monjalon <thomas@monjalon.net>
> wrote:
> >
> > 27/01/2023 10:02, Jerin Jacob:
> > > On Fri, Jan 27, 2023 at 2:20 PM Thomas Monjalon
> <thomas@monjalon.net> wrote:
> > > > 27/01/2023 07:40, Jerin Jacob:
> > > > > On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon
> <thomas@monjalon.net> wrote:
> > > > > > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > > > > > This patch series implements the common ML code that
> > > > > > > > > > > can be used by ML drivers. Common code include
> > > > > > > > > > > functions to convert ML IO type to string, IO format
> > > > > > > > > > > type to string, function get size of ML IO type, and
> > > > > > > > > > > functions for converting data types from higher precision to
> lower precision and vice-versa.
> > > > > > > > > >
> > > > > > > > > > I'm not sure about the path of this code.
> > > > > > > > > > In general we implement drivers helper in the same
> > > > > > > > > > directory as the driver and mark it as internal.
> > > > > > > > > > Would it work here?
> > > > > > > > >
> > > > > > > > > We are planning to implement two different ML drivers,
> > > > > > > > > ml/cnxk driver
> > > > > > > > (submitted for review) and a software only driver (part of
> > > > > > > > ML roadmap and currently WIP). Both the drivers would be
> > > > > > > > using these common functions for quantization and
> > > > > > > > dequantization. Hence, placed the files in common/ml directory.
> > > > > > > > >
> > > > > > > > > Moreover, these functions are used to convert data from
> > > > > > > > > higher to lower
> > > > > > > > precision or vice-versa and  can also be used by future ML
> > > > > > > > drivers for other platforms.
> > > > > > > >
> > > > > > > > I understand, and what you say does not contradict with
> > > > > > > > having this code in lib/mldev/.
> > > > > > > > So would you agree to move?
> > > > > > >
> > > > > > > These common functions do not have an rte_ml_dev_ prefix.
> > > > > >
> > > > > > As it is exported, it should have rte_ prefix.
> > > > >
> > > > > The exposed functions are similar to lib/ethdev/sff_* where
> > > > > multiple driver can "use" it but not by application directly.
> > > > > If so, What is the recommendation
> > > > > a) Keeping driver/common/ml without rte_prefix
> > > > > b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> > > > >
> > > > > I prefer (a) as it will not pollute lib/mldev. No strong
> > > > > opinion, either. Let me know your view or any other suggestion?
> > > >
> > > > I don't see it as pollution, it comes with the library, so I
> > > > prefer lib/mldev/ with rte_mldev_pmd_ prefix.
> > > >
> > > >
> > > > > > Is it ok to have non-RTE code in lib/mldev. If yes, we can move to
> lib/mldev.
> > > > > >
> > > > > > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> > > > >
> > > > > Here scope is different. See above.
> > > >
> > > > No the scope is not different.
> > > > They are functions used by drivers not by application.
> > >
> > > When you say lib/ethdev/ethdev_driver.h. You mean "struct
> eth_dev_ops" scheme.
> >
> > No I don't mean that. Did you check the internal functions in this file?
> > I mean functions like rte_eth_dev_allocate() or
> rte_eth_dev_attach_secondary().
> 
> Got it. Let's change to rte_ml_pmd_ prefix and add to lib/mldev then.

Considering the scope of these functions, I think, instead of rte_ml_pmd_ prefix, rte_ml_io_ prefix is more suitable? Also it would in similar lines with internal functions defined in other libraries. 

I can push the push a revised patch series accordingly.
> 
> >
> >
> >

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 0/4] Implementation of ML common code
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (4 preceding siblings ...)
  2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
@ 2023-02-01  9:04 ` Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
                     ` (3 more replies)
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
  7 siblings, 4 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:04 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, vector NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.


Srikanth Yalavarthi (4):
  mldev: add headers for internal ML functions
  mldev: implement ML IO type handling functions
  mldev: add scalar type conversion functions
  mldev: add Arm NEON type conversion routines

 lib/mldev/meson.build          |   7 +
 lib/mldev/mldev_utils.c        | 118 +++++
 lib/mldev/mldev_utils.h        | 345 +++++++++++++
 lib/mldev/mldev_utils_neon.c   | 873 +++++++++++++++++++++++++++++++++
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++
 lib/mldev/version.map          |  16 +
 6 files changed, 2079 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h
 create mode 100644 lib/mldev/mldev_utils_neon.c
 create mode 100644 lib/mldev/mldev_utils_scalar.c

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 1/4] mldev: add headers for internal ML functions
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
@ 2023-02-01  9:04   ` Srikanth Yalavarthi
  2023-02-01 13:54     ` Anup Prabhu
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added header files for internal ML utility routines to convert
IO type and format to string, IO type to size and routines to
convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/meson.build   |   2 +
 lib/mldev/mldev_utils.c |   5 +
 lib/mldev/mldev_utils.h | 345 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 352 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 5c99532c1a..452b83a480 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -4,6 +4,7 @@
 sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
+        'mldev_utils.c',
 )
 
 headers = files(
@@ -16,6 +17,7 @@ indirect_headers += files(
 
 driver_sdk_headers += files(
         'rte_mldev_pmd.h',
+        'mldev_utils.h',
 )
 
 deps += ['mempool']
diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
new file mode 100644
index 0000000000..9dbbf013a0
--- /dev/null
+++ b/lib/mldev/mldev_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "mldev_utils.h"
diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h
new file mode 100644
index 0000000000..04cdaab567
--- /dev/null
+++ b/lib/mldev/mldev_utils.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _RTE_MLDEV_UTILS_H_
+#define _RTE_MLDEV_UTILS_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * RTE ML Device PMD utility API
+ *
+ * These APIs for the use from ML drivers, user applications shouldn't use them.
+ *
+ */
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * @internal
+ *
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MLDEV_UTILS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
@ 2023-02-01  9:04   ` Srikanth Yalavarthi
  2023-02-01 13:53     ` Anup Prabhu
                       ` (3 more replies)
  2023-02-01  9:04   ` [PATCH v4 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
  3 siblings, 4 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Implemented ML utility functions to convert IO data type to name,
IO format to name and routine to get the size of an IO data type
in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++
 lib/mldev/version.map   |   4 ++
 2 files changed, 117 insertions(+)

diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
index 9dbbf013a0..d2442b123b 100644
--- a/lib/mldev/mldev_utils.c
+++ b/lib/mldev/mldev_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */
 
+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "mldev_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index 61955ab701..c2ceedfbb4 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -46,4 +46,8 @@ INTERNAL {
 	rte_ml_dev_pmd_get_dev;
 	rte_ml_dev_pmd_get_named_dev;
 	rte_ml_dev_pmd_release;
+
+	rte_ml_io_type_size_get;
+	rte_ml_io_type_to_str;
+	rte_ml_io_format_to_str;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 3/4] mldev: add scalar type conversion functions
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
@ 2023-02-01  9:04   ` Srikanth Yalavarthi
  2023-02-01  9:04   ` [PATCH v4 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
  3 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/meson.build          |   1 +
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++++++++
 lib/mldev/version.map          |  12 +
 3 files changed, 733 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_scalar.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 452b83a480..fce9c0ebee 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -5,6 +5,7 @@ sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
         'mldev_utils.c',
+        'mldev_utils_scalar.c',
 )
 
 headers = files(
diff --git a/lib/mldev/mldev_utils_scalar.c b/lib/mldev/mldev_utils_scalar.c
new file mode 100644
index 0000000000..40320ed3ef
--- /dev/null
+++ b/lib/mldev/mldev_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "mldev_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index c2ceedfbb4..f11d5de1ef 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -50,4 +50,16 @@ INTERNAL {
 	rte_ml_io_type_size_get;
 	rte_ml_io_type_to_str;
 	rte_ml_io_format_to_str;
+	rte_ml_io_float32_to_int8;
+	rte_ml_io_int8_to_float32;
+	rte_ml_io_float32_to_uint8;
+	rte_ml_io_uint8_to_float32;
+	rte_ml_io_float32_to_int16;
+	rte_ml_io_int16_to_float32;
+	rte_ml_io_float32_to_uint16;
+	rte_ml_io_uint16_to_float32;
+	rte_ml_io_float32_to_float16;
+	rte_ml_io_float16_to_float32;
+	rte_ml_io_float32_to_bfloat16;
+	rte_ml_io_bfloat16_to_float32;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 4/4] mldev: add Arm NEON type conversion routines
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
                     ` (2 preceding siblings ...)
  2023-02-01  9:04   ` [PATCH v4 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
@ 2023-02-01  9:04   ` Srikanth Yalavarthi
  3 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:04 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/meson.build        |   4 +
 lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_neon.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index fce9c0ebee..05694b0839 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -8,6 +8,10 @@ sources = files(
         'mldev_utils_scalar.c',
 )
 
+if arch_subdir == 'arm'
+    sources += files('mldev_utils_neon.c')
+endif
+
 headers = files(
         'rte_mldev.h',
 )
diff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c
new file mode 100644
index 0000000000..32b620db20
--- /dev/null
+++ b/lib/mldev/mldev_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "mldev_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 0/4] Implementation of ML common code
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (5 preceding siblings ...)
  2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
@ 2023-02-01  9:12 ` Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
                     ` (3 more replies)
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
  7 siblings, 4 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:12 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, vector NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.

Srikanth Yalavarthi (4):
  mldev: add headers for internal ML functions
  mldev: implement ML IO type handling functions
  mldev: add scalar type conversion functions
  mldev: add Arm NEON type conversion routines

 lib/mldev/meson.build          |   7 +
 lib/mldev/mldev_utils.c        | 118 +++++
 lib/mldev/mldev_utils.h        | 345 +++++++++++++
 lib/mldev/mldev_utils_neon.c   | 873 +++++++++++++++++++++++++++++++++
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++
 lib/mldev/version.map          |  16 +
 6 files changed, 2079 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h
 create mode 100644 lib/mldev/mldev_utils_neon.c
 create mode 100644 lib/mldev/mldev_utils_scalar.c

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 1/4] mldev: add headers for internal ML functions
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
@ 2023-02-01  9:12   ` Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:12 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added header files for internal ML utility routines to convert
IO type and format to string, IO type to size and routines to
convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26046 ("app/mldev: implement test framework for mldev")

v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v3:
* Skip installation of internal common/ml headers

v2:
* Moved implementation out of patch. Only headers are included.

 lib/mldev/meson.build   |   2 +
 lib/mldev/mldev_utils.c |   5 +
 lib/mldev/mldev_utils.h | 345 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 352 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 5c99532c1a..452b83a480 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -4,6 +4,7 @@
 sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
+        'mldev_utils.c',
 )

 headers = files(
@@ -16,6 +17,7 @@ indirect_headers += files(

 driver_sdk_headers += files(
         'rte_mldev_pmd.h',
+        'mldev_utils.h',
 )

 deps += ['mempool']
diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
new file mode 100644
index 0000000000..9dbbf013a0
--- /dev/null
+++ b/lib/mldev/mldev_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "mldev_utils.h"
diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h
new file mode 100644
index 0000000000..04cdaab567
--- /dev/null
+++ b/lib/mldev/mldev_utils.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _RTE_MLDEV_UTILS_H_
+#define _RTE_MLDEV_UTILS_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * RTE ML Device PMD utility API
+ *
+ * These APIs for the use from ML drivers, user applications shouldn't use them.
+ *
+ */
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * @internal
+ *
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MLDEV_UTILS_H_ */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
@ 2023-02-01  9:12   ` Srikanth Yalavarthi
  2023-02-02  4:20     ` Anup Prabhu
  2023-02-01  9:12   ` [PATCH v5 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
  3 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:12 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Implemented ML utility functions to convert IO data type to name,
IO format to name and routine to get the size of an IO data type
in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Implemented common utility functions as part of the patch
* Dropped use of driver routines for data conversion functions

 lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++
 lib/mldev/version.map   |   4 ++
 2 files changed, 117 insertions(+)

diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
index 9dbbf013a0..d2442b123b 100644
--- a/lib/mldev/mldev_utils.c
+++ b/lib/mldev/mldev_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */

+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "mldev_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index 61955ab701..c2ceedfbb4 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -46,4 +46,8 @@ INTERNAL {
 	rte_ml_dev_pmd_get_dev;
 	rte_ml_dev_pmd_get_named_dev;
 	rte_ml_dev_pmd_release;
+
+	rte_ml_io_type_size_get;
+	rte_ml_io_type_to_str;
+	rte_ml_io_format_to_str;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 3/4] mldev: add scalar type conversion functions
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
@ 2023-02-01  9:12   ` Srikanth Yalavarthi
  2023-02-01  9:12   ` [PATCH v5 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
  3 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:12 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Updated internal function names
* Updated function attributes to __rte_weak

 lib/mldev/meson.build          |   1 +
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++++++++
 lib/mldev/version.map          |  12 +
 3 files changed, 733 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_scalar.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 452b83a480..fce9c0ebee 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -5,6 +5,7 @@ sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
         'mldev_utils.c',
+        'mldev_utils_scalar.c',
 )

 headers = files(
diff --git a/lib/mldev/mldev_utils_scalar.c b/lib/mldev/mldev_utils_scalar.c
new file mode 100644
index 0000000000..40320ed3ef
--- /dev/null
+++ b/lib/mldev/mldev_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "mldev_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index c2ceedfbb4..f11d5de1ef 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -50,4 +50,16 @@ INTERNAL {
 	rte_ml_io_type_size_get;
 	rte_ml_io_type_to_str;
 	rte_ml_io_format_to_str;
+	rte_ml_io_float32_to_int8;
+	rte_ml_io_int8_to_float32;
+	rte_ml_io_float32_to_uint8;
+	rte_ml_io_uint8_to_float32;
+	rte_ml_io_float32_to_int16;
+	rte_ml_io_int16_to_float32;
+	rte_ml_io_float32_to_uint16;
+	rte_ml_io_uint16_to_float32;
+	rte_ml_io_float32_to_float16;
+	rte_ml_io_float16_to_float32;
+	rte_ml_io_float32_to_bfloat16;
+	rte_ml_io_bfloat16_to_float32;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 4/4] mldev: add Arm NEON type conversion routines
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
                     ` (2 preceding siblings ...)
  2023-02-01  9:12   ` [PATCH v5 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
@ 2023-02-01  9:12   ` Srikanth Yalavarthi
  3 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:12 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang; +Cc: dev, sshankarnara, jerinj, aprabhu

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Dropped use of driver routines to call neon functions
* Optimization of neon functions. Reduce the number of intrinsic calls.

 lib/mldev/meson.build        |   4 +
 lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_neon.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index fce9c0ebee..05694b0839 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -8,6 +8,10 @@ sources = files(
         'mldev_utils_scalar.c',
 )

+if arch_subdir == 'arm'
+    sources += files('mldev_utils_neon.c')
+endif
+
 headers = files(
         'rte_mldev.h',
 )
diff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c
new file mode 100644
index 0000000000..32b620db20
--- /dev/null
+++ b/lib/mldev/mldev_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "mldev_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
  2023-01-31 13:44                         ` Srikanth Yalavarthi
@ 2023-02-01  9:15                           ` Srikanth Yalavarthi
  0 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-01  9:15 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: Shivah Shankar Shankar Narayan Rao, dev,
	Jerin Jacob Kollanukkaran, Anup Prabhu, ferruh.yigit,
	bruce.richardson, david.marchand, Srikanth Yalavarthi



Srikanth Yalavarthi
Senior Staff Engineer

> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: 31 January 2023 19:15
> To: Jerin Jacob <jerinjacobk@gmail.com>; Thomas Monjalon
> <thomas@monjalon.net>
> Cc: Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>;
> dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup
> Prabhu <aprabhu@marvell.com>; ferruh.yigit@amd.com;
> bruce.richardson@intel.com; david.marchand@redhat.com; Srikanth
> Yalavarthi <syalavarthi@marvell.com>; Srikanth Yalavarthi
> <syalavarthi@marvell.com>
> Subject: RE: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: 27 January 2023 15:58
> > To: Thomas Monjalon <thomas@monjalon.net>
> > Cc: Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>;
> > Srikanth Yalavarthi <syalavarthi@marvell.com>; dev@dpdk.org; Jerin
> > Jacob Kollanukkaran <jerinj@marvell.com>; Anup Prabhu
> > <aprabhu@marvell.com>; ferruh.yigit@amd.com;
> > bruce.richardson@intel.com; david.marchand@redhat.com; Srikanth
> > Yalavarthi <syalavarthi@marvell.com>
> > Subject: Re: [EXT] Re: [PATCH v3 0/4] implementation of ML common code
> >
> > On Fri, Jan 27, 2023 at 2:56 PM Thomas Monjalon <thomas@monjalon.net>
> > wrote:
> > >
> > > 27/01/2023 10:02, Jerin Jacob:
> > > > On Fri, Jan 27, 2023 at 2:20 PM Thomas Monjalon
> > <thomas@monjalon.net> wrote:
> > > > > 27/01/2023 07:40, Jerin Jacob:
> > > > > > On Thu, Jan 26, 2023 at 4:27 PM Thomas Monjalon
> > <thomas@monjalon.net> wrote:
> > > > > > > 25/01/2023 15:59, Srikanth Yalavarthi:
> > > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > > 25/01/2023 14:25, Srikanth Yalavarthi:
> > > > > > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > > > > > 20/12/2022 18:52, Srikanth Yalavarthi:
> > > > > > > > > > > > This patch series implements the common ML code
> > > > > > > > > > > > that can be used by ML drivers. Common code
> > > > > > > > > > > > include functions to convert ML IO type to string,
> > > > > > > > > > > > IO format type to string, function get size of ML
> > > > > > > > > > > > IO type, and functions for converting data types
> > > > > > > > > > > > from higher precision to
> > lower precision and vice-versa.
> > > > > > > > > > >
> > > > > > > > > > > I'm not sure about the path of this code.
> > > > > > > > > > > In general we implement drivers helper in the same
> > > > > > > > > > > directory as the driver and mark it as internal.
> > > > > > > > > > > Would it work here?
> > > > > > > > > >
> > > > > > > > > > We are planning to implement two different ML drivers,
> > > > > > > > > > ml/cnxk driver
> > > > > > > > > (submitted for review) and a software only driver (part
> > > > > > > > > of ML roadmap and currently WIP). Both the drivers would
> > > > > > > > > be using these common functions for quantization and
> > > > > > > > > dequantization. Hence, placed the files in common/ml
> directory.
> > > > > > > > > >
> > > > > > > > > > Moreover, these functions are used to convert data
> > > > > > > > > > from higher to lower
> > > > > > > > > precision or vice-versa and  can also be used by future
> > > > > > > > > ML drivers for other platforms.
> > > > > > > > >
> > > > > > > > > I understand, and what you say does not contradict with
> > > > > > > > > having this code in lib/mldev/.
> > > > > > > > > So would you agree to move?
> > > > > > > >
> > > > > > > > These common functions do not have an rte_ml_dev_ prefix.
> > > > > > >
> > > > > > > As it is exported, it should have rte_ prefix.
> > > > > >
> > > > > > The exposed functions are similar to lib/ethdev/sff_* where
> > > > > > multiple driver can "use" it but not by application directly.
> > > > > > If so, What is the recommendation
> > > > > > a) Keeping driver/common/ml without rte_prefix
> > > > > > b) Keeping in lib/mldev/ with rte_mldev_pmd_ prefix?
> > > > > >
> > > > > > I prefer (a) as it will not pollute lib/mldev. No strong
> > > > > > opinion, either. Let me know your view or any other suggestion?
> > > > >
> > > > > I don't see it as pollution, it comes with the library, so I
> > > > > prefer lib/mldev/ with rte_mldev_pmd_ prefix.
> > > > >
> > > > >
> > > > > > > Is it ok to have non-RTE code in lib/mldev. If yes, we can
> > > > > > > move to
> > lib/mldev.
> > > > > > >
> > > > > > > Look at lib/ethdev/ethdev_driver.h, it should be similar.
> > > > > >
> > > > > > Here scope is different. See above.
> > > > >
> > > > > No the scope is not different.
> > > > > They are functions used by drivers not by application.
> > > >
> > > > When you say lib/ethdev/ethdev_driver.h. You mean "struct
> > eth_dev_ops" scheme.
> > >
> > > No I don't mean that. Did you check the internal functions in this file?
> > > I mean functions like rte_eth_dev_allocate() or
> > rte_eth_dev_attach_secondary().
> >
> > Got it. Let's change to rte_ml_pmd_ prefix and add to lib/mldev then.
> 
> Considering the scope of these functions, I think, instead of rte_ml_pmd_
> prefix, rte_ml_io_ prefix is more suitable? Also it would in similar lines with
> internal functions defined in other libraries.
> 
> I can push the push a revised patch series accordingly.
Updated the series. Moved the code to lib/mldev and added rte_ml_io_ prefix to the functions.
http://patches.dpdk.org/project/dpdk/list/?series=26731

> >
> > >
> > >
> > >

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v4 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
@ 2023-02-01 13:53     ` Anup Prabhu
  2023-02-01 14:01     ` Anup Prabhu
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Anup Prabhu @ 2023-02-01 13:53 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

[-- Attachment #1: Type: text/plain, Size: 4171 bytes --]



-----Original Message-----
From: Srikanth Yalavarthi <syalavarthi@marvell.com> 
Sent: Wednesday, February 1, 2023 2:35 PM
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
Subject: [PATCH v4 2/4] mldev: implement ML IO type handling functions

Implemented ML utility functions to convert IO data type to name, IO format to name and routine to get the size of an IO data type in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++
 lib/mldev/version.map   |   4 ++
 2 files changed, 117 insertions(+)

Acked-by: Anup Prabhu <aprabhu@marvell.com>

diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c index 9dbbf013a0..d2442b123b 100644
--- a/lib/mldev/mldev_utils.c
+++ b/lib/mldev/mldev_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */
 
+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "mldev_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type) {
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len) {
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int 
+len) {
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map index 61955ab701..c2ceedfbb4 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -46,4 +46,8 @@ INTERNAL {
 	rte_ml_dev_pmd_get_dev;
 	rte_ml_dev_pmd_get_named_dev;
 	rte_ml_dev_pmd_release;
+
+	rte_ml_io_type_size_get;
+	rte_ml_io_type_to_str;
+	rte_ml_io_format_to_str;
 };
--
2.17.1


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 23526 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v4 1/4] mldev: add headers for internal ML functions
  2023-02-01  9:04   ` [PATCH v4 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
@ 2023-02-01 13:54     ` Anup Prabhu
  2023-02-01 15:28       ` Thomas Monjalon
  0 siblings, 1 reply; 59+ messages in thread
From: Anup Prabhu @ 2023-02-01 13:54 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

[-- Attachment #1: Type: text/plain, Size: 11912 bytes --]



-----Original Message-----
From: Srikanth Yalavarthi <syalavarthi@marvell.com> 
Sent: Wednesday, February 1, 2023 2:35 PM
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
Subject: [PATCH v4 1/4] mldev: add headers for internal ML functions

Added header files for internal ML utility routines to convert IO type and format to string, IO type to size and routines to convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 lib/mldev/meson.build   |   2 +
 lib/mldev/mldev_utils.c |   5 +
 lib/mldev/mldev_utils.h | 345 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 352 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c  create mode 100644 lib/mldev/mldev_utils.h

Acked-by: Anup Prabhu <aprabhu@marvell.com>

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build index 5c99532c1a..452b83a480 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -4,6 +4,7 @@
 sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
+        'mldev_utils.c',
 )
 
 headers = files(
@@ -16,6 +17,7 @@ indirect_headers += files(
 
 driver_sdk_headers += files(
         'rte_mldev_pmd.h',
+        'mldev_utils.h',
 )
 
 deps += ['mempool']
diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c new file mode 100644 index 0000000000..9dbbf013a0
--- /dev/null
+++ b/lib/mldev/mldev_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "mldev_utils.h"
diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h new file mode 100644 index 0000000000..04cdaab567
--- /dev/null
+++ b/lib/mldev/mldev_utils.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _RTE_MLDEV_UTILS_H_
+#define _RTE_MLDEV_UTILS_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * RTE ML Device PMD utility API
+ *
+ * These APIs for the use from ML drivers, user applications shouldn't use them.
+ *
+ */
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * @internal
+ *
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int 
+len);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 8-bit integer format 
+(INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 8-bit integer format 
+(UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 16-bit integer format 
+(INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 16-bit integer 
+format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void 
+*input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void 
+*output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in half precision floating 
+format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void 
+*output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating 
+format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void 
+*output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in brain floating point format 
+(bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void 
+*output);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MLDEV_UTILS_H_ */
--
2.17.1


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 24980 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v4 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
  2023-02-01 13:53     ` Anup Prabhu
@ 2023-02-01 14:01     ` Anup Prabhu
  2023-02-01 14:15     ` Anup Prabhu
  2023-02-01 14:26     ` Anup Prabhu
  3 siblings, 0 replies; 59+ messages in thread
From: Anup Prabhu @ 2023-02-01 14:01 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

[-- Attachment #1: Type: text/plain, Size: 4487 bytes --]



> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Wednesday, February 1, 2023 2:35 PM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [PATCH v4 2/4] mldev: implement ML IO type handling functions
> 
> Implemented ML utility functions to convert IO data type to name, IO format
> to name and routine to get the size of an IO data type in bytes.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  lib/mldev/mldev_utils.c | 113
> ++++++++++++++++++++++++++++++++++++++++
>  lib/mldev/version.map   |   4 ++
>  2 files changed, 117 insertions(+)
> 

Acked-by: Anup Prabhu <aprabhu@marvell.com>

> diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c index
> 9dbbf013a0..d2442b123b 100644
> --- a/lib/mldev/mldev_utils.c
> +++ b/lib/mldev/mldev_utils.c
> @@ -2,4 +2,117 @@
>   * Copyright (c) 2022 Marvell.
>   */
> 
> +#include <errno.h>
> +#include <stdint.h>
> +
> +#include <rte_mldev.h>
> +#include <rte_string_fns.h>
> +
>  #include "mldev_utils.h"
> +
> +/* Description:
> + * This file implements Machine Learning utility routines, except type
> conversion routines.
> + */
> +
> +int
> +rte_ml_io_type_size_get(enum rte_ml_io_type type) {
> +	switch (type) {
> +	case RTE_ML_IO_TYPE_UNKNOWN:
> +		return -EINVAL;
> +	case RTE_ML_IO_TYPE_INT8:
> +		return sizeof(int8_t);
> +	case RTE_ML_IO_TYPE_UINT8:
> +		return sizeof(uint8_t);
> +	case RTE_ML_IO_TYPE_INT16:
> +		return sizeof(int16_t);
> +	case RTE_ML_IO_TYPE_UINT16:
> +		return sizeof(uint16_t);
> +	case RTE_ML_IO_TYPE_INT32:
> +		return sizeof(int32_t);
> +	case RTE_ML_IO_TYPE_UINT32:
> +		return sizeof(uint32_t);
> +	case RTE_ML_IO_TYPE_FP8:
> +		return sizeof(uint8_t);
> +	case RTE_ML_IO_TYPE_FP16:
> +		return sizeof(uint8_t) * 2;
> +	case RTE_ML_IO_TYPE_FP32:
> +		return sizeof(uint8_t) * 4;
> +	case RTE_ML_IO_TYPE_BFLOAT16:
> +		return sizeof(uint8_t) * 2;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +void
> +rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len) {
> +	switch (type) {
> +	case RTE_ML_IO_TYPE_UNKNOWN:
> +		rte_strlcpy(str, "unknown", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT8:
> +		rte_strlcpy(str, "int8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT8:
> +		rte_strlcpy(str, "uint8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT16:
> +		rte_strlcpy(str, "int16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT16:
> +		rte_strlcpy(str, "uint16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT32:
> +		rte_strlcpy(str, "int32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT32:
> +		rte_strlcpy(str, "uint32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP8:
> +		rte_strlcpy(str, "float8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP16:
> +		rte_strlcpy(str, "float16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP32:
> +		rte_strlcpy(str, "float32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_BFLOAT16:
> +		rte_strlcpy(str, "bfloat16", len);
> +		break;
> +	default:
> +		rte_strlcpy(str, "invalid", len);
> +	}
> +}
> +
> +void
> +rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int
> +len) {
> +	switch (format) {
> +	case RTE_ML_IO_FORMAT_NCHW:
> +		rte_strlcpy(str, "NCHW", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_NHWC:
> +		rte_strlcpy(str, "NHWC", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_CHWN:
> +		rte_strlcpy(str, "CHWN", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_3D:
> +		rte_strlcpy(str, "3D", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_2D:
> +		rte_strlcpy(str, "Matrix", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_1D:
> +		rte_strlcpy(str, "Vector", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_SCALAR:
> +		rte_strlcpy(str, "Scalar", len);
> +		break;
> +	default:
> +		rte_strlcpy(str, "invalid", len);
> +	}
> +}
> diff --git a/lib/mldev/version.map b/lib/mldev/version.map index
> 61955ab701..c2ceedfbb4 100644
> --- a/lib/mldev/version.map
> +++ b/lib/mldev/version.map
> @@ -46,4 +46,8 @@ INTERNAL {
>  	rte_ml_dev_pmd_get_dev;
>  	rte_ml_dev_pmd_get_named_dev;
>  	rte_ml_dev_pmd_release;
> +
> +	rte_ml_io_type_size_get;
> +	rte_ml_io_type_to_str;
> +	rte_ml_io_format_to_str;
>  };
> --
> 2.17.1


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 23566 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v4 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
  2023-02-01 13:53     ` Anup Prabhu
  2023-02-01 14:01     ` Anup Prabhu
@ 2023-02-01 14:15     ` Anup Prabhu
  2023-02-01 14:26     ` Anup Prabhu
  3 siblings, 0 replies; 59+ messages in thread
From: Anup Prabhu @ 2023-02-01 14:15 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

[-- Attachment #1: Type: text/plain, Size: 4487 bytes --]



> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Wednesday, February 1, 2023 2:35 PM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [PATCH v4 2/4] mldev: implement ML IO type handling functions
> 
> Implemented ML utility functions to convert IO data type to name, IO format
> to name and routine to get the size of an IO data type in bytes.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  lib/mldev/mldev_utils.c | 113
> ++++++++++++++++++++++++++++++++++++++++
>  lib/mldev/version.map   |   4 ++
>  2 files changed, 117 insertions(+)
> 

Acked-by: Anup Prabhu <aprabhu@marvell.com>

> diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c index
> 9dbbf013a0..d2442b123b 100644
> --- a/lib/mldev/mldev_utils.c
> +++ b/lib/mldev/mldev_utils.c
> @@ -2,4 +2,117 @@
>   * Copyright (c) 2022 Marvell.
>   */
> 
> +#include <errno.h>
> +#include <stdint.h>
> +
> +#include <rte_mldev.h>
> +#include <rte_string_fns.h>
> +
>  #include "mldev_utils.h"
> +
> +/* Description:
> + * This file implements Machine Learning utility routines, except type
> conversion routines.
> + */
> +
> +int
> +rte_ml_io_type_size_get(enum rte_ml_io_type type) {
> +	switch (type) {
> +	case RTE_ML_IO_TYPE_UNKNOWN:
> +		return -EINVAL;
> +	case RTE_ML_IO_TYPE_INT8:
> +		return sizeof(int8_t);
> +	case RTE_ML_IO_TYPE_UINT8:
> +		return sizeof(uint8_t);
> +	case RTE_ML_IO_TYPE_INT16:
> +		return sizeof(int16_t);
> +	case RTE_ML_IO_TYPE_UINT16:
> +		return sizeof(uint16_t);
> +	case RTE_ML_IO_TYPE_INT32:
> +		return sizeof(int32_t);
> +	case RTE_ML_IO_TYPE_UINT32:
> +		return sizeof(uint32_t);
> +	case RTE_ML_IO_TYPE_FP8:
> +		return sizeof(uint8_t);
> +	case RTE_ML_IO_TYPE_FP16:
> +		return sizeof(uint8_t) * 2;
> +	case RTE_ML_IO_TYPE_FP32:
> +		return sizeof(uint8_t) * 4;
> +	case RTE_ML_IO_TYPE_BFLOAT16:
> +		return sizeof(uint8_t) * 2;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +void
> +rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len) {
> +	switch (type) {
> +	case RTE_ML_IO_TYPE_UNKNOWN:
> +		rte_strlcpy(str, "unknown", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT8:
> +		rte_strlcpy(str, "int8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT8:
> +		rte_strlcpy(str, "uint8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT16:
> +		rte_strlcpy(str, "int16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT16:
> +		rte_strlcpy(str, "uint16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_INT32:
> +		rte_strlcpy(str, "int32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_UINT32:
> +		rte_strlcpy(str, "uint32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP8:
> +		rte_strlcpy(str, "float8", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP16:
> +		rte_strlcpy(str, "float16", len);
> +		break;
> +	case RTE_ML_IO_TYPE_FP32:
> +		rte_strlcpy(str, "float32", len);
> +		break;
> +	case RTE_ML_IO_TYPE_BFLOAT16:
> +		rte_strlcpy(str, "bfloat16", len);
> +		break;
> +	default:
> +		rte_strlcpy(str, "invalid", len);
> +	}
> +}
> +
> +void
> +rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int
> +len) {
> +	switch (format) {
> +	case RTE_ML_IO_FORMAT_NCHW:
> +		rte_strlcpy(str, "NCHW", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_NHWC:
> +		rte_strlcpy(str, "NHWC", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_CHWN:
> +		rte_strlcpy(str, "CHWN", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_3D:
> +		rte_strlcpy(str, "3D", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_2D:
> +		rte_strlcpy(str, "Matrix", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_1D:
> +		rte_strlcpy(str, "Vector", len);
> +		break;
> +	case RTE_ML_IO_FORMAT_SCALAR:
> +		rte_strlcpy(str, "Scalar", len);
> +		break;
> +	default:
> +		rte_strlcpy(str, "invalid", len);
> +	}
> +}
> diff --git a/lib/mldev/version.map b/lib/mldev/version.map index
> 61955ab701..c2ceedfbb4 100644
> --- a/lib/mldev/version.map
> +++ b/lib/mldev/version.map
> @@ -46,4 +46,8 @@ INTERNAL {
>  	rte_ml_dev_pmd_get_dev;
>  	rte_ml_dev_pmd_get_named_dev;
>  	rte_ml_dev_pmd_release;
> +
> +	rte_ml_io_type_size_get;
> +	rte_ml_io_type_to_str;
> +	rte_ml_io_format_to_str;
>  };
> --
> 2.17.1


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 23562 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v4 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
                       ` (2 preceding siblings ...)
  2023-02-01 14:15     ` Anup Prabhu
@ 2023-02-01 14:26     ` Anup Prabhu
  3 siblings, 0 replies; 59+ messages in thread
From: Anup Prabhu @ 2023-02-01 14:26 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]



> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Wednesday, February 1, 2023 2:35 PM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [PATCH v4 2/4] mldev: implement ML IO type handling functions
> 
> Implemented ML utility functions to convert IO data type to name, IO format
> to name and routine to get the size of an IO data type in bytes.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  lib/mldev/mldev_utils.c | 113
> ++++++++++++++++++++++++++++++++++++++++
>  lib/mldev/version.map   |   4 ++
>  2 files changed, 117 insertions(+)
> 

Acked-by: Anup Prabhu <aprabhu@marvell.com>

[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 22374 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 1/4] mldev: add headers for internal ML functions
  2023-02-01 13:54     ` Anup Prabhu
@ 2023-02-01 15:28       ` Thomas Monjalon
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Monjalon @ 2023-02-01 15:28 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Shivah Shankar Shankar Narayan Rao, Anup Prabhu
  Cc: dev, Jerin Jacob Kollanukkaran, Prince Takkar, Parijat Shukla

Hi Anup,

Your ack is lost in patch lines.
Please make sure your email client is quoting the original email lines,
so we can distinguish your answer from the original email.

Also you can drop useless lines when replying (see how I removed patch content here).


01/02/2023 14:54, Anup Prabhu:
> 
> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com> 
> Sent: Wednesday, February 1, 2023 2:35 PM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [PATCH v4 1/4] mldev: add headers for internal ML functions
> 
> Added header files for internal ML utility routines to convert IO type and format to string, IO type to size and routines to convert data types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  lib/mldev/meson.build   |   2 +
>  lib/mldev/mldev_utils.c |   5 +
>  lib/mldev/mldev_utils.h | 345 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 352 insertions(+)
>  create mode 100644 lib/mldev/mldev_utils.c  create mode 100644 lib/mldev/mldev_utils.h
> 
> Acked-by: Anup Prabhu <aprabhu@marvell.com>




^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v5 2/4] mldev: implement ML IO type handling functions
  2023-02-01  9:12   ` [PATCH v5 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
@ 2023-02-02  4:20     ` Anup Prabhu
  0 siblings, 0 replies; 59+ messages in thread
From: Anup Prabhu @ 2023-02-02  4:20 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Srikanth Yalavarthi
  Cc: dev, Shivah Shankar Shankar Narayan Rao,
	Jerin Jacob Kollanukkaran, Parijat Shukla, Prince Takkar

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]



> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Wednesday, February 1, 2023 2:43 PM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>
> Subject: [PATCH v5 2/4] mldev: implement ML IO type handling functions
> 
> Implemented ML utility functions to convert IO data type to name, IO format
> to name and routine to get the size of an IO data type in bytes.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>

Acked-by: Anup Prabhu <aprabhu@marvell.com>


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 22246 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 0/4] Implementation of ML common code
  2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
                   ` (6 preceding siblings ...)
  2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
@ 2023-02-07 16:00 ` Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
                     ` (4 more replies)
  7 siblings, 5 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-07 16:00 UTC (permalink / raw)
  Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla,
	Srikanth Yalavarthi

Machine Learning common code
----------------------------

This patch series implements the common ML code that can be used by
ML drivers. Common code include functions to convert ML IO type to
string, IO format type to string, function get size of ML IO type,
and functions for converting data types from higher precision to
lower precision and vice-versa.

Data type conversion functions support handling float32, float16,
bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
functions are implemented in the series, generic scalar version and
vector version using Arm NEON intrinsics. When compiling DPDK for
platform supporting Arm NEON, vector NEON version of the routines would
be enabled. Compilation would fall back to generic scalar versions on
platform like x86_64 / PowerPC etc., that don't support Arm NEON.

Srikanth Yalavarthi (4):
  mldev: add headers for internal ML functions
  mldev: implement ML IO type handling functions
  mldev: add scalar type conversion functions
  mldev: add Arm NEON type conversion routines

 doc/guides/rel_notes/release_23_03.rst |   5 +
 lib/mldev/meson.build                  |   7 +
 lib/mldev/mldev_utils.c                | 118 ++++
 lib/mldev/mldev_utils.h                | 345 ++++++++++
 lib/mldev/mldev_utils_neon.c           | 873 +++++++++++++++++++++++++
 lib/mldev/mldev_utils_scalar.c         | 720 ++++++++++++++++++++
 lib/mldev/version.map                  |  16 +
 7 files changed, 2084 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h
 create mode 100644 lib/mldev/mldev_utils_neon.c
 create mode 100644 lib/mldev/mldev_utils_scalar.c

--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 1/4] mldev: add headers for internal ML functions
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
@ 2023-02-07 16:00   ` Srikanth Yalavarthi
  2023-03-09 20:44     ` Thomas Monjalon
  2023-02-07 16:00   ` [PATCH v6 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-07 16:00 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

Added header files for internal ML utility routines to convert
IO type and format to string, IO type to size and routines to
convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26858 ("Implementation of mldev test application")

v6:
* Updated release notes and series dependencies

v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v3:
* Skip installation of internal common/ml headers

v2:
* Moved implementation out of patch. Only headers are included.

 doc/guides/rel_notes/release_23_03.rst |   5 +
 lib/mldev/meson.build                  |   2 +
 lib/mldev/mldev_utils.c                |   5 +
 lib/mldev/mldev_utils.h                | 345 +++++++++++++++++++++++++
 4 files changed, 357 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h

diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index cd1ac98abe..425323241e 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -95,6 +95,11 @@ New Features
   * Test case for inferences from multiple models in ordered mode.
   * Test case for inferences from multiple models.in interleaving mode.

+* **Added common driver functions for machine learning device library.**
+
+  * Added functions to translate IO type and format to string.
+  * Added functions to quantize and dequantize inference IO data.
+

 Removed Items
 -------------
diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 5c99532c1a..452b83a480 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -4,6 +4,7 @@
 sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
+        'mldev_utils.c',
 )

 headers = files(
@@ -16,6 +17,7 @@ indirect_headers += files(

 driver_sdk_headers += files(
         'rte_mldev_pmd.h',
+        'mldev_utils.h',
 )

 deps += ['mempool']
diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
new file mode 100644
index 0000000000..9dbbf013a0
--- /dev/null
+++ b/lib/mldev/mldev_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "mldev_utils.h"
diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h
new file mode 100644
index 0000000000..04cdaab567
--- /dev/null
+++ b/lib/mldev/mldev_utils.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _RTE_MLDEV_UTILS_H_
+#define _RTE_MLDEV_UTILS_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * RTE ML Device PMD utility API
+ *
+ * These APIs for the use from ML drivers, user applications shouldn't use them.
+ *
+ */
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * @internal
+ *
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MLDEV_UTILS_H_ */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 2/4] mldev: implement ML IO type handling functions
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
@ 2023-02-07 16:00   ` Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-07 16:00 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

Implemented ML utility functions to convert IO data type to name,
IO format to name and routine to get the size of an IO data type
in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Implemented common utility functions as part of the patch
* Dropped use of driver routines for data conversion functions

 lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++
 lib/mldev/version.map   |   4 ++
 2 files changed, 117 insertions(+)

diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
index 9dbbf013a0..d2442b123b 100644
--- a/lib/mldev/mldev_utils.c
+++ b/lib/mldev/mldev_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */

+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "mldev_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index d2b30a991a..9d06659493 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -48,4 +48,8 @@ INTERNAL {
 	rte_ml_dev_pmd_get_dev;
 	rte_ml_dev_pmd_get_named_dev;
 	rte_ml_dev_pmd_release;
+
+	rte_ml_io_type_size_get;
+	rte_ml_io_type_to_str;
+	rte_ml_io_format_to_str;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 3/4] mldev: add scalar type conversion functions
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
@ 2023-02-07 16:00   ` Srikanth Yalavarthi
  2023-02-07 16:00   ` [PATCH v6 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
  2023-03-09 21:37   ` [PATCH v6 0/4] Implementation of ML common code Thomas Monjalon
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-07 16:00 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Updated internal function names
* Updated function attributes to __rte_weak

 lib/mldev/meson.build          |   1 +
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++++++++
 lib/mldev/version.map          |  12 +
 3 files changed, 733 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_scalar.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 452b83a480..fce9c0ebee 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -5,6 +5,7 @@ sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
         'mldev_utils.c',
+        'mldev_utils_scalar.c',
 )

 headers = files(
diff --git a/lib/mldev/mldev_utils_scalar.c b/lib/mldev/mldev_utils_scalar.c
new file mode 100644
index 0000000000..40320ed3ef
--- /dev/null
+++ b/lib/mldev/mldev_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "mldev_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index 9d06659493..0706b565be 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -52,4 +52,16 @@ INTERNAL {
 	rte_ml_io_type_size_get;
 	rte_ml_io_type_to_str;
 	rte_ml_io_format_to_str;
+	rte_ml_io_float32_to_int8;
+	rte_ml_io_int8_to_float32;
+	rte_ml_io_float32_to_uint8;
+	rte_ml_io_uint8_to_float32;
+	rte_ml_io_float32_to_int16;
+	rte_ml_io_int16_to_float32;
+	rte_ml_io_float32_to_uint16;
+	rte_ml_io_uint16_to_float32;
+	rte_ml_io_float32_to_float16;
+	rte_ml_io_float16_to_float32;
+	rte_ml_io_float32_to_bfloat16;
+	rte_ml_io_bfloat16_to_float32;
 };
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 4/4] mldev: add Arm NEON type conversion routines
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
                     ` (2 preceding siblings ...)
  2023-02-07 16:00   ` [PATCH v6 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
@ 2023-02-07 16:00   ` Srikanth Yalavarthi
  2023-03-09 21:37   ` [PATCH v6 0/4] Implementation of ML common code Thomas Monjalon
  4 siblings, 0 replies; 59+ messages in thread
From: Srikanth Yalavarthi @ 2023-02-07 16:00 UTC (permalink / raw)
  To: Srikanth Yalavarthi, Ruifeng Wang
  Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Dropped use of driver routines to call neon functions
* Optimization of neon functions. Reduce the number of intrinsic calls.

 lib/mldev/meson.build        |   4 +
 lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_neon.c

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index fce9c0ebee..05694b0839 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -8,6 +8,10 @@ sources = files(
         'mldev_utils_scalar.c',
 )

+if arch_subdir == 'arm'
+    sources += files('mldev_utils_neon.c')
+endif
+
 headers = files(
         'rte_mldev.h',
 )
diff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c
new file mode 100644
index 0000000000..32b620db20
--- /dev/null
+++ b/lib/mldev/mldev_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "mldev_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
--
2.17.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 1/4] mldev: add headers for internal ML functions
  2023-02-07 16:00   ` [PATCH v6 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
@ 2023-03-09 20:44     ` Thomas Monjalon
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Monjalon @ 2023-03-09 20:44 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

07/02/2023 17:00, Srikanth Yalavarthi:
> Added header files for internal ML utility routines to convert
> IO type and format to string, IO type to size and routines to
> convert data types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
> Depends-on: series-26858 ("Implementation of mldev test application")

I'm not sure it depends really on the test application.

>  doc/guides/rel_notes/release_23_03.rst |   5 +
>  lib/mldev/meson.build                  |   2 +
>  lib/mldev/mldev_utils.c                |   5 +
>  lib/mldev/mldev_utils.h                | 345 +++++++++++++++++++++++++

Instead of just adding the header file,
it would have more sense to add the scalar implementation at the same time,
and split IO type and conversion in different patches.

> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> +* **Added common driver functions for machine learning device library.**
> +
> +  * Added functions to translate IO type and format to string.
> +  * Added functions to quantize and dequantize inference IO data.

I don't think it does not deserve to be in the release notes.




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 0/4] Implementation of ML common code
  2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
                     ` (3 preceding siblings ...)
  2023-02-07 16:00   ` [PATCH v6 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
@ 2023-03-09 21:37   ` Thomas Monjalon
  4 siblings, 0 replies; 59+ messages in thread
From: Thomas Monjalon @ 2023-03-09 21:37 UTC (permalink / raw)
  To: Srikanth Yalavarthi; +Cc: dev, sshankarnara, jerinj, aprabhu, ptakkar, pshukla

07/02/2023 17:00, Srikanth Yalavarthi:
> Machine Learning common code
> ----------------------------
> 
> This patch series implements the common ML code that can be used by
> ML drivers. Common code include functions to convert ML IO type to
> string, IO format type to string, function get size of ML IO type,
> and functions for converting data types from higher precision to
> lower precision and vice-versa.
> 
> Data type conversion functions support handling float32, float16,
> bfloat16, uint8, int8, uint16 and int16. Two versions of conversion
> functions are implemented in the series, generic scalar version and
> vector version using Arm NEON intrinsics. When compiling DPDK for
> platform supporting Arm NEON, vector NEON version of the routines would
> be enabled. Compilation would fall back to generic scalar versions on
> platform like x86_64 / PowerPC etc., that don't support Arm NEON.
> 
> Srikanth Yalavarthi (4):
>   mldev: add headers for internal ML functions
>   mldev: implement ML IO type handling functions
>   mldev: add scalar type conversion functions
>   mldev: add Arm NEON type conversion routines

Applied with a bit of split & squashing, thanks.



^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2023-03-09 21:37 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-08 19:35 [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
2022-12-08 19:35 ` [PATCH v1 1/4] common/ml: add initial files for " Srikanth Yalavarthi
2022-12-08 19:35 ` [PATCH v1 2/4] common/ml: add data type conversion routines Srikanth Yalavarthi
2022-12-08 19:35 ` [PATCH v1 3/4] common/ml: add generic type conversion functions Srikanth Yalavarthi
2022-12-08 19:35 ` [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
2022-12-12  7:16   ` Ruifeng Wang
2022-12-12 17:25     ` Srikanth Yalavarthi
2022-12-12 17:21 ` [PATCH v1 0/4] implementation of ML common code Srikanth Yalavarthi
2022-12-12 17:21   ` [PATCH v2 1/4] common/ml: add initial files for " Srikanth Yalavarthi
2022-12-12 17:21   ` [PATCH v2 2/4] common/ml: add common utility functions Srikanth Yalavarthi
2022-12-12 17:21   ` [PATCH v2 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
2022-12-12 17:21   ` [PATCH v2 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
2022-12-13  9:04     ` Ruifeng Wang
2022-12-20 17:52   ` [PATCH v3 0/4] implementation of ML common code Srikanth Yalavarthi
2022-12-20 17:52     ` [PATCH v3 1/4] common/ml: add initial files for " Srikanth Yalavarthi
2022-12-20 19:04       ` Stephen Hemminger
2022-12-20 19:19         ` [EXT] " Srikanth Yalavarthi
2022-12-20 17:52     ` [PATCH v3 2/4] common/ml: add common utility functions Srikanth Yalavarthi
2022-12-20 17:52     ` [PATCH v3 3/4] common/ml: add scalar type conversion functions Srikanth Yalavarthi
2022-12-20 17:52     ` [PATCH v3 4/4] common/ml: add Arm NEON type conversion routines Srikanth Yalavarthi
2022-12-21  3:08       ` Ruifeng Wang
2022-12-20 19:06     ` [PATCH v3 0/4] implementation of ML common code Stephen Hemminger
2022-12-20 19:17       ` [EXT] " Srikanth Yalavarthi
2023-01-25 13:18     ` Thomas Monjalon
2023-01-25 13:25       ` [EXT] " Srikanth Yalavarthi
2023-01-25 13:55         ` Thomas Monjalon
2023-01-25 14:59           ` Srikanth Yalavarthi
2023-01-26 10:57             ` Thomas Monjalon
2023-01-27  6:40               ` Jerin Jacob
2023-01-27  8:50                 ` Thomas Monjalon
2023-01-27  9:02                   ` Jerin Jacob
2023-01-27  9:26                     ` Thomas Monjalon
2023-01-27 10:28                       ` Jerin Jacob
2023-01-31 13:44                         ` Srikanth Yalavarthi
2023-02-01  9:15                           ` Srikanth Yalavarthi
2023-02-01  9:04 ` [PATCH v4 0/4] Implementation " Srikanth Yalavarthi
2023-02-01  9:04   ` [PATCH v4 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
2023-02-01 13:54     ` Anup Prabhu
2023-02-01 15:28       ` Thomas Monjalon
2023-02-01  9:04   ` [PATCH v4 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
2023-02-01 13:53     ` Anup Prabhu
2023-02-01 14:01     ` Anup Prabhu
2023-02-01 14:15     ` Anup Prabhu
2023-02-01 14:26     ` Anup Prabhu
2023-02-01  9:04   ` [PATCH v4 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
2023-02-01  9:04   ` [PATCH v4 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
2023-02-01  9:12 ` [PATCH v5 0/4] Implementation of ML common code Srikanth Yalavarthi
2023-02-01  9:12   ` [PATCH v5 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
2023-02-01  9:12   ` [PATCH v5 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
2023-02-02  4:20     ` Anup Prabhu
2023-02-01  9:12   ` [PATCH v5 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
2023-02-01  9:12   ` [PATCH v5 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
2023-02-07 16:00 ` [PATCH v6 0/4] Implementation of ML common code Srikanth Yalavarthi
2023-02-07 16:00   ` [PATCH v6 1/4] mldev: add headers for internal ML functions Srikanth Yalavarthi
2023-03-09 20:44     ` Thomas Monjalon
2023-02-07 16:00   ` [PATCH v6 2/4] mldev: implement ML IO type handling functions Srikanth Yalavarthi
2023-02-07 16:00   ` [PATCH v6 3/4] mldev: add scalar type conversion functions Srikanth Yalavarthi
2023-02-07 16:00   ` [PATCH v6 4/4] mldev: add Arm NEON type conversion routines Srikanth Yalavarthi
2023-03-09 21:37   ` [PATCH v6 0/4] Implementation of ML common code Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).