From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D06C043C88; Mon, 11 Mar 2024 15:47:42 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 108E540A67; Mon, 11 Mar 2024 15:47:33 +0100 (CET) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id F3197402DC for ; Mon, 11 Mar 2024 15:47:29 +0100 (CET) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 37AC71007; Mon, 11 Mar 2024 07:48:06 -0700 (PDT) Received: from ampere-altra-2-1.usa.Arm.com (ampere-altra-2-1.usa.arm.com [10.118.91.158]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 27A9F3F844; Mon, 11 Mar 2024 07:47:29 -0700 (PDT) From: Paul Szczepanek To: dev@dpdk.org Cc: bruce.richardson@intel.com, Paul Szczepanek , Honnappa Nagarahalli , Kamalakshitha Aligeri , Nathan Brown Subject: [PATCH v9 2/5] ptr_compress: add pointer compression library Date: Mon, 11 Mar 2024 14:47:03 +0000 Message-Id: <20240311144706.204831-3-paul.szczepanek@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311144706.204831-1-paul.szczepanek@arm.com> References: <20230927150854.3670391-2-paul.szczepanek@arm.com> <20240311144706.204831-1-paul.szczepanek@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Add a new utility header for compressing pointers. The provided functions can store pointers in 32-bit or 16-bit offsets. The compression takes advantage of the fact that pointers are usually located in a limited memory region (like a mempool). We can compress them by converting them to offsets from a base memory address. Offsets can be stored in fewer bytes (dictated by the memory region size and alignment of the pointer). For example: an 8 byte aligned pointer which is part of a 32GB memory pool can be stored in 4 bytes. Suggested-by: Honnappa Nagarahalli Signed-off-by: Paul Szczepanek Signed-off-by: Kamalakshitha Aligeri Reviewed-by: Honnappa Nagarahalli Reviewed-by: Nathan Brown --- MAINTAINERS | 4 + doc/api/doxy-api-index.md | 1 + doc/api/doxy-api.conf.in | 1 + doc/guides/rel_notes/release_24_03.rst | 6 + lib/meson.build | 1 + lib/ptr_compress/meson.build | 4 + lib/ptr_compress/rte_ptr_compress.h | 266 +++++++++++++++++++++++++ 7 files changed, 283 insertions(+) create mode 100644 lib/ptr_compress/meson.build create mode 100644 lib/ptr_compress/rte_ptr_compress.h diff --git a/MAINTAINERS b/MAINTAINERS index 4755a68274..6f703b1b13 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1685,6 +1685,10 @@ M: Chenbo Xia M: Gaetan Rivet F: lib/pci/ +Pointer Compression +M: Paul Szczepanek +F: lib/ptr_compress/ + Power management M: Anatoly Burakov M: David Hunt diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md index 8c1eb8fafa..f9283154f8 100644 --- a/doc/api/doxy-api-index.md +++ b/doc/api/doxy-api-index.md @@ -222,6 +222,7 @@ The public API headers are grouped by topics: [config file](@ref rte_cfgfile.h), [key/value args](@ref rte_kvargs.h), [argument parsing](@ref rte_argparse.h), + [ptr_compress](@ref rte_ptr_compress.h), [string](@ref rte_string_fns.h), [thread](@ref rte_thread.h) diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in index 27afec8b3b..a8823c046f 100644 --- a/doc/api/doxy-api.conf.in +++ b/doc/api/doxy-api.conf.in @@ -71,6 +71,7 @@ INPUT = @TOPDIR@/doc/api/doxy-api-index.md \ @TOPDIR@/lib/pipeline \ @TOPDIR@/lib/port \ @TOPDIR@/lib/power \ + @TOPDIR@/lib/ptr_compress \ @TOPDIR@/lib/rawdev \ @TOPDIR@/lib/rcu \ @TOPDIR@/lib/regexdev \ diff --git a/doc/guides/rel_notes/release_24_03.rst b/doc/guides/rel_notes/release_24_03.rst index 932688ca4d..b82b8c5c0b 100644 --- a/doc/guides/rel_notes/release_24_03.rst +++ b/doc/guides/rel_notes/release_24_03.rst @@ -176,6 +176,12 @@ New Features * Added power-saving during polling within the ``rte_event_dequeue_burst()`` API. * Added support for DMA adapter. +* **Introduced pointer compression library.** + + Library provides functions to compress and decompress arrays of pointers + which can improve application performance under certain conditions. + Performance test was added to help users evaluate performance on their setup. + Removed Items ------------- diff --git a/lib/meson.build b/lib/meson.build index e4e31f7ecf..fe43d137d7 100644 --- a/lib/meson.build +++ b/lib/meson.build @@ -14,6 +14,7 @@ libraries = [ 'argparse', 'telemetry', # basic info querying 'eal', # everything depends on eal + 'ptr_compress', 'ring', 'rcu', # rcu depends on ring 'mempool', diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build new file mode 100644 index 0000000000..e92706a45f --- /dev/null +++ b/lib/ptr_compress/meson.build @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2024 Arm Limited + +headers = files('rte_ptr_compress.h') diff --git a/lib/ptr_compress/rte_ptr_compress.h b/lib/ptr_compress/rte_ptr_compress.h new file mode 100644 index 0000000000..97c084003d --- /dev/null +++ b/lib/ptr_compress/rte_ptr_compress.h @@ -0,0 +1,266 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2024 Arm Limited + */ + +#ifndef RTE_PTR_COMPRESS_H +#define RTE_PTR_COMPRESS_H + +/** + * @file + * Pointer compression and decompression functions. + * + * When passing arrays full of pointers between threads, memory containing + * the pointers is copied multiple times which is especially costly between + * cores. These functions allow us to compress the pointers. + * + * Compression takes advantage of the fact that pointers are usually located in + * a limited memory region (like a mempool). We compress them by converting them + * to offsets from a base memory address. Offsets can be stored in fewer bytes. + * + * The compression functions come in two varieties: 32-bit and 16-bit. + * + * To determine how many bits are needed to compress the pointer calculate + * the biggest offset possible (highest value pointer - base pointer) + * and shift the value right according to alignment (shift by exponent of the + * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by + * 3, etc.). The resulting value must fit in either 32 or 16 bits. + * + * For usage example and further explanation please see "Pointer Compression" in + * doc/guides/prog_guide/ptr_compress_lib.rst + */ + +#include +#include + +#include +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Compress pointers into 32-bit offsets from base pointer. + * + * @note It is programmer's responsibility to ensure the resulting offsets fit + * into 32 bits. Alignment of the structures pointed to by the pointers allows + * us to drop bits from the offsets. This is controlled by the bit_shift + * parameter. This means that if structures are aligned by 8 bytes they must be + * within 32GB of the base pointer. If there is no such alignment guarantee they + * must be within 4GB. + * + * @param ptr_base + * A pointer used to calculate offsets of pointers in src_table. + * @param src_table + * A pointer to an array of pointers. + * @param dest_table + * A pointer to an array of compressed pointers returned by this function. + * @param n + * The number of objects to compress, must be strictly positive. + * @param bit_shift + * Byte alignment of memory pointed to by the pointers allows for + * bits to be dropped from the offset and hence widen the memory region that + * can be covered. This controls how many bits are right shifted. + **/ +static __rte_always_inline void +rte_ptr_compress_32(void *ptr_base, void **src_table, + uint32_t *dest_table, size_t n, uint8_t bit_shift) +{ + unsigned int i = 0; +#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32 + svuint64_t v_ptr_table; + svbool_t pg = svwhilelt_b64(i, n); + do { + v_ptr_table = svld1_u64(pg, (uint64_t *)src_table + i); + v_ptr_table = svsub_x(pg, v_ptr_table, (uint64_t)ptr_base); + v_ptr_table = svlsr_x(pg, v_ptr_table, bit_shift); + svst1w(pg, &dest_table[i], v_ptr_table); + i += svcntd(); + pg = svwhilelt_b64(i, n); + } while (i < n); +#elif defined __ARM_NEON && !defined RTE_ARCH_ARMv8_AARCH32 + uint64_t ptr_diff; + uint64x2_t v_ptr_table; + /* right shift is done by left shifting by negative int */ + int64x2_t v_shift = vdupq_n_s64(-bit_shift); + uint64x2_t v_ptr_base = vdupq_n_u64((uint64_t)ptr_base); + for (; i < (n & ~0x1); i += 2) { + v_ptr_table = vld1q_u64((const uint64_t *)src_table + i); + v_ptr_table = vsubq_u64(v_ptr_table, v_ptr_base); + v_ptr_table = vshlq_u64(v_ptr_table, v_shift); + vst1_u32(dest_table + i, vqmovn_u64(v_ptr_table)); + } + /* process leftover single item in case of odd number of n */ + if (unlikely(n & 0x1)) { + ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base); + dest_table[i] = (uint32_t) (ptr_diff >> bit_shift); + } +#else + uintptr_t ptr_diff; + for (; i < n; i++) { + ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base); + ptr_diff = ptr_diff >> bit_shift; + RTE_ASSERT(ptr_diff <= UINT32_MAX); + dest_table[i] = (uint32_t) ptr_diff; + } +#endif +} + +/** + * Decompress pointers from 32-bit offsets from base pointer. + * + * @param ptr_base + * A pointer which was used to calculate offsets in src_table. + * @param src_table + * A pointer to an array to compressed pointers. + * @param dest_table + * A pointer to an array of decompressed pointers returned by this function. + * @param n + * The number of objects to decompress, must be strictly positive. + * @param bit_shift + * Byte alignment of memory pointed to by the pointers allows for + * bits to be dropped from the offset and hence widen the memory region that + * can be covered. This controls how many bits are left shifted when pointers + * are recovered from the offsets. + **/ +static __rte_always_inline void +rte_ptr_decompress_32(void *ptr_base, uint32_t *src_table, + void **dest_table, size_t n, uint8_t bit_shift) +{ + unsigned int i = 0; +#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32 + svuint64_t v_ptr_table; + svbool_t pg = svwhilelt_b64(i, n); + do { + v_ptr_table = svld1uw_u64(pg, &src_table[i]); + v_ptr_table = svlsl_x(pg, v_ptr_table, bit_shift); + v_ptr_table = svadd_x(pg, v_ptr_table, (uint64_t)ptr_base); + svst1(pg, (uint64_t *)dest_table + i, v_ptr_table); + i += svcntd(); + pg = svwhilelt_b64(i, n); + } while (i < n); +#elif defined __ARM_NEON && !defined RTE_ARCH_ARMv8_AARCH32 + uint64_t ptr_diff; + uint64x2_t v_ptr_table; + int64x2_t v_shift = vdupq_n_s64(bit_shift); + uint64x2_t v_ptr_base = vdupq_n_u64((uint64_t)ptr_base); + for (; i < (n & ~0x1); i += 2) { + v_ptr_table = vmovl_u32(vld1_u32(src_table + i)); + v_ptr_table = vshlq_u64(v_ptr_table, v_shift); + v_ptr_table = vaddq_u64(v_ptr_table, v_ptr_base); + vst1q_u64((uint64_t *)dest_table + i, v_ptr_table); + } + /* process leftover single item in case of odd number of n */ + if (unlikely(n & 0x1)) { + ptr_diff = ((uint64_t) src_table[i]) << bit_shift; + dest_table[i] = RTE_PTR_ADD(ptr_base, ptr_diff); + } +#else + uintptr_t ptr_diff; + for (; i < n; i++) { + ptr_diff = ((uintptr_t) src_table[i]) << bit_shift; + dest_table[i] = RTE_PTR_ADD(ptr_base, ptr_diff); + } +#endif +} + +/** + * Compress pointers into 16-bit offsets from base pointer. + * + * @note It is programmer's responsibility to ensure the resulting offsets fit + * into 16 bits. Alignment of the structures pointed to by the pointers allows + * us to drop bits from the offsets. This is controlled by the bit_shift + * parameter. This means that if structures are aligned by 8 bytes they must be + * within 256KB of the base pointer. If there is no such alignment guarantee + * they must be within 64KB. + * + * @param ptr_base + * A pointer used to calculate offsets of pointers in src_table. + * @param src_table + * A pointer to an array of pointers. + * @param dest_table + * A pointer to an array of compressed pointers returned by this function. + * @param n + * The number of objects to compress, must be strictly positive. + * @param bit_shift + * Byte alignment of memory pointed to by the pointers allows for + * bits to be dropped from the offset and hence widen the memory region that + * can be covered. This controls how many bits are right shifted. + **/ +static __rte_always_inline void +rte_ptr_compress_16(void *ptr_base, void **src_table, + uint16_t *dest_table, size_t n, uint8_t bit_shift) +{ + + unsigned int i = 0; +#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32 + svuint64_t v_ptr_table; + svbool_t pg = svwhilelt_b64(i, n); + do { + v_ptr_table = svld1_u64(pg, (uint64_t *)src_table + i); + v_ptr_table = svsub_x(pg, v_ptr_table, (uint64_t)ptr_base); + v_ptr_table = svlsr_x(pg, v_ptr_table, bit_shift); + svst1h(pg, &dest_table[i], v_ptr_table); + i += svcntd(); + pg = svwhilelt_b64(i, n); + } while (i < n); +#else + uintptr_t ptr_diff; + for (; i < n; i++) { + ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base); + ptr_diff = ptr_diff >> bit_shift; + RTE_ASSERT(ptr_diff <= UINT16_MAX); + dest_table[i] = (uint16_t) ptr_diff; + } +#endif +} + +/** + * Decompress pointers from 16-bit offsets from base pointer. + * + * @param ptr_base + * A pointer which was used to calculate offsets in src_table. + * @param src_table + * A pointer to an array to compressed pointers. + * @param dest_table + * A pointer to an array of decompressed pointers returned by this function. + * @param n + * The number of objects to decompress, must be strictly positive. + * @param bit_shift + * Byte alignment of memory pointed to by the pointers allows for + * bits to be dropped from the offset and hence widen the memory region that + * can be covered. This controls how many bits are left shifted when pointers + * are recovered from the offsets. + **/ +static __rte_always_inline void +rte_ptr_decompress_16(void *ptr_base, uint16_t *src_table, + void **dest_table, size_t n, uint8_t bit_shift) +{ + unsigned int i = 0; +#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32 + svuint64_t v_ptr_table; + svbool_t pg = svwhilelt_b64(i, n); + do { + v_ptr_table = svld1uh_u64(pg, &src_table[i]); + v_ptr_table = svlsl_x(pg, v_ptr_table, bit_shift); + v_ptr_table = svadd_x(pg, v_ptr_table, (uint64_t)ptr_base); + svst1(pg, (uint64_t *)dest_table + i, v_ptr_table); + i += svcntd(); + pg = svwhilelt_b64(i, n); + } while (i < n); +#else + uintptr_t ptr_diff; + for (; i < n; i++) { + ptr_diff = ((uintptr_t) src_table[i]) << bit_shift; + dest_table[i] = RTE_PTR_ADD(ptr_base, ptr_diff); + } +#endif +} + +#ifdef __cplusplus +} +#endif + +#endif /* RTE_PTR_COMPRESS_H */ -- 2.25.1