From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from we2-f167.wedos.net (w-smtp-out-7.wedos.net [46.28.106.5]) by dpdk.org (Postfix) with ESMTP id 73C2F8E5D for ; Sun, 11 Oct 2015 23:17:28 +0200 (CEST) Received: from ([109.81.211.153]) by we2-f167.wedos.net (WEDOS Mail Server mail2) with ASMTP (SSL) id ZLD00023; Sun, 11 Oct 2015 23:17:23 +0200 Date: Sun, 11 Oct 2015 23:17:13 +0200 From: Jan Viktorin To: David Hunt Message-ID: <20151011231713.7bd22aac@jvn> In-Reply-To: <1443822602-8648-2-git-send-email-david.hunt@intel.com> References: <1443822602-8648-2-git-send-email-david.hunt@intel.com> Organization: RehiveTech X-Mailer: Claws Mail 3.12.0 (GTK+ 2.24.28; x86_64-unknown-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: dev@dpdk.org, Vlastimil Kosar Subject: Re: [dpdk-dev] lib: added support for armv7 architecture X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Oct 2015 21:17:28 -0000 Hello David, all, I am reviewing your patch for things to be included in the final ARMv7 support... It is quite long, you know. We should probably break the potential discussion in shorter e-mails, a thread per topic. See my comments below. Regards Jan On Fri, 2 Oct 2015 22:50:02 +0100 David Hunt wrote: > From: Amruta Zende > > Signed-off-by: Amruta Zende > Signed-off-by: David Hunt > > --- > MAINTAINERS | 5 + > config/defconfig_arm-native-linuxapp-gcc | 56 ++++ > .../common/include/arch/arm/rte_atomic.h | 269 ++++++++++++++++++++ > .../common/include/arch/arm/rte_byteorder.h | 146 +++++++++++ > .../common/include/arch/arm/rte_cpuflags.h | 138 ++++++++++ > .../common/include/arch/arm/rte_cycles.h | 77 ++++++ > .../common/include/arch/arm/rte_memcpy.h | 101 ++++++++ > .../common/include/arch/arm/rte_prefetch.h | 64 +++++ > .../common/include/arch/arm/rte_rwlock.h | 70 +++++ > .../common/include/arch/arm/rte_spinlock.h | 116 +++++++++ I do not know yet, how different is the ARMv8 support, how it integrates with the ARMv7 part... Will we have a single directory "arm"? > lib/librte_eal/common/include/arch/arm/rte_vect.h | 37 +++ > lib/librte_eal/linuxapp/Makefile | 3 + > lib/librte_eal/linuxapp/arm_pmu/Makefile | 52 ++++ > lib/librte_eal/linuxapp/arm_pmu/rte_enable_pmu.c | 83 ++++++ > mk/arch/arm/rte.vars.mk | 58 +++++ > mk/machine/armv7-a/rte.vars.mk | 63 +++++ > mk/toolchain/gcc/rte.vars.mk | 8 +- > 17 files changed, 1343 insertions(+), 3 deletions(-) > create mode 100644 config/defconfig_arm-native-linuxapp-gcc > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_atomic.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_byteorder.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_cpuflags.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_cycles.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memcpy.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_prefetch.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_rwlock.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_spinlock.h > create mode 100644 lib/librte_eal/common/include/arch/arm/rte_vect.h > create mode 100755 lib/librte_eal/linuxapp/arm_pmu/Makefile > create mode 100644 lib/librte_eal/linuxapp/arm_pmu/rte_enable_pmu.c > create mode 100644 mk/arch/arm/rte.vars.mk > create mode 100644 mk/machine/armv7-a/rte.vars.mk > > diff --git a/MAINTAINERS b/MAINTAINERS > index 080a8e8..9d99d53 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -124,6 +124,11 @@ IBM POWER > M: Chao Zhu > F: lib/librte_eal/common/include/arch/ppc_64/ > > +Arm V7 > +M: Amrute Zende > +M: David Hunt > +F: lib/librte_eal/common/include/arch/arm/ We were discussing the ARM maintainer with Thomas Manjalon. As Dave seems, he is not going to do the maintanance in a long-term, I can take care of the ARM maintanance as well. However, I will not have a proper ARMv8 machine available for some time. > + > Intel x86 > M: Bruce Richardson > M: Konstantin Ananyev > diff --git a/config/defconfig_arm-native-linuxapp-gcc b/config/defconfig_arm-native-linuxapp-gcc This is interesting. Should we have a configuration for "arm" or better for "armv7" or "armv7-a" or even more specifically for "cortex-a9/15/..."? > new file mode 100644 > index 0000000..159aa36 > --- /dev/null > +++ b/config/defconfig_arm-native-linuxapp-gcc > @@ -0,0 +1,56 @@ > +# BSD LICENSE > +# > +# Copyright(c) 2015 Intel Corporation. All rights reserved. > +# > [...] > +# > + > +#include "common_linuxapp" > + > +CONFIG_RTE_MACHINE="armv7-a" > + > +CONFIG_RTE_ARCH="arm" > +CONFIG_RTE_ARCH_ARM32=y > +CONFIG_RTE_ARCH_32=y > + > +CONFIG_RTE_TOOLCHAIN="gcc" > +CONFIG_RTE_TOOLCHAIN_GCC=y > + > +CONFIG_RTE_FORCE_INTRINSICS=y > +CONFIG_RTE_LIBRTE_VHOST=n > +CONFIG_RTE_LIBRTE_KNI=n > +CONFIG_RTE_KNI_KMOD=n I could see a note somewhere that KNI is for 64-bit systems only. Am I right? Would it be possible to have KNI for ARM? > +CONFIG_RTE_LIBRTE_LPM=n > +CONFIG_RTE_LIBRTE_ACL=n > +CONFIG_RTE_LIBRTE_SCHED=n > +CONFIG_RTE_LIBRTE_PORT=n > +CONFIG_RTE_LIBRTE_PIPELINE=n > +CONFIG_RTE_LIBRTE_TABLE=n The libraries compiles (expect of LPM and ACL due to SSE issues). So some of them will be enabled as default. > +CONFIG_RTE_IXGBE_INC_VECTOR=n > +CONFIG_RTE_LIBRTE_VIRTIO_PMD=n > +CONFIG_RTE_LIBRTE_CXGBE_PMD=n > + > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic.h b/lib/librte_eal/common/include/arch/arm/rte_atomic.h > new file mode 100644 > index 0000000..2a7339c > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic.h > @@ -0,0 +1,269 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_ATOMIC_ARM32_H_ > +#define _RTE_ATOMIC_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_atomic.h" > + > +/** > + * @file > + * Atomic Operations > + * > + * This file defines a API for atomic operations. > + */ > + > +/** > + * General memory barrier. > + * > + * Guarantees that the LOAD and STORE operations generated before the > + * barrier occur before the LOAD and STORE operations generated after. > + */ > +#define rte_mb() { __sync_synchronize(); } > + > +/** > + * Write memory barrier. > + * > + * Guarantees that the STORE operations generated before the barrier > + * occur before the STORE operations generated after. > + */ > +#define rte_wmb() {asm volatile("dsb st" : : : "memory"); } > + > +/** > + * Read memory barrier. > + * > + * Guarantees that the LOAD operations generated before the barrier > + * occur before the LOAD operations generated after. > + */ > +#define rte_rmb() {asm volatile("dsb " : : : "memory"); } > + This is probably too strong barrier. Our patch series also doesn't do it properly. It seems, we better should have rte_wmb: dmb st rte_rmb: dmb sy (= __sync_synchronize) > + > + > +/*------------------------- 16 bit atomic operations -------------------------*/ > + > +#ifndef RTE_FORCE_INTRINSICS > +static inline int > +rte_atomic16_cmpset(volatile uint16_t *dst, uint16_t exp, uint16_t src) > +{ > + return __sync_bool_compare_and_swap(dst, exp, src); Should we go with __sync_* or __atomic_* constructs? The __atomic_* instrinsics are coming with GCC 4.7. Our patch set uses __atomic_* ones. > +} > + > +static inline void > +rte_atomic16_inc(rte_atomic16_t *v) > +{ > + rte_atomic16_add(v, 1); > +} > + > +static inline void > +rte_atomic16_dec(rte_atomic16_t *v) > +{ > + rte_atomic16_sub(v, 1); > +} > + > +static inline int rte_atomic16_inc_and_test(rte_atomic16_t *v) > +{ > + return (__sync_add_and_fetch(&v->cnt, 1) == 0); > +} > + > +static inline int rte_atomic16_dec_and_test(rte_atomic16_t *v) > +{ > + return (__sync_sub_and_fetch(&v->cnt, 1) == 0); > +} > + > +static inline int rte_atomic16_test_and_set(rte_atomic16_t *v) > +{ > + return rte_atomic16_cmpset((volatile uint16_t *)&v->cnt, 0, 1); > +} > + > + > +/*------------------------- 32 bit atomic operations -------------------------*/ > + > +static inline int > +rte_atomic32_cmpset(volatile uint32_t *dst, uint32_t exp, uint32_t src) > +{ > + return __sync_bool_compare_and_swap(dst, exp, src); > +} > + > +static inline void > +rte_atomic32_inc(rte_atomic32_t *v) > +{ > + rte_atomic32_add(v, 1); > +} > + > +static inline void > +rte_atomic32_dec(rte_atomic32_t *v) > +{ > + rte_atomic32_sub(v, 1); > +} > + > +static inline int rte_atomic32_inc_and_test(rte_atomic32_t *v) > +{ > + return (__sync_add_and_fetch(&v->cnt, 1) == 0); > +} > + > +static inline int rte_atomic32_dec_and_test(rte_atomic32_t *v) > +{ > + return (__sync_sub_and_fetch(&v->cnt, 1) == 0); > +} > + > +static inline int rte_atomic32_test_and_set(rte_atomic32_t *v) > +{ > + return rte_atomic32_cmpset((volatile uint32_t *)&v->cnt, 0, 1); > +} > + > +/*------------------------- 64 bit atomic operations -------------------------*/ > + > +static inline int > +rte_atomic64_cmpset(volatile uint64_t *dst, uint64_t exp, uint64_t src) > +{ > + return __sync_bool_compare_and_swap(dst, exp, src); > +} > + > +static inline void > +rte_atomic64_init(rte_atomic64_t *v) > +{ > +#ifdef __LP64__ > + v->cnt = 0; I assume, this is for ARMv8... > +#else > + int success = 0; > + uint64_t tmp; > + > + while (success == 0) { > + tmp = v->cnt; > + success = rte_atomic64_cmpset((volatile uint64_t *)&v->cnt, > + tmp, 0); > + } > +#endif > +} > + > +static inline int64_t > +rte_atomic64_read(rte_atomic64_t *v) > +{ > +#ifdef __LP64__ > + return v->cnt; > +#else > + int success = 0; > + uint64_t tmp; > + > + while (success == 0) { > + tmp = v->cnt; > + /* replace the value by itself */ > + success = rte_atomic64_cmpset((volatile uint64_t *)&v->cnt, > + tmp, tmp); > + } > + return tmp; > +#endif > +} > + > +static inline void > +rte_atomic64_set(rte_atomic64_t *v, int64_t new_value) > +{ > +#ifdef __LP64__ > + v->cnt = new_value; > +#else > + int success = 0; > + uint64_t tmp; > + > + while (success == 0) { > + tmp = v->cnt; > + success = rte_atomic64_cmpset((volatile uint64_t *)&v->cnt, > + tmp, new_value); > + } > +#endif > +} > + > +static inline void > +rte_atomic64_add(rte_atomic64_t *v, int64_t inc) > +{ > + __sync_fetch_and_add(&v->cnt, inc); > +} > + > +static inline void > +rte_atomic64_sub(rte_atomic64_t *v, int64_t dec) > +{ > + __sync_fetch_and_sub(&v->cnt, dec); > +} > + > +static inline void > +rte_atomic64_inc(rte_atomic64_t *v) > +{ > + rte_atomic64_add(v, 1); > +} > + > +static inline void > +rte_atomic64_dec(rte_atomic64_t *v) > +{ > + rte_atomic64_sub(v, 1); > +} > + > +static inline int64_t > +rte_atomic64_add_return(rte_atomic64_t *v, int64_t inc) > +{ > + return __sync_add_and_fetch(&v->cnt, inc); > +} > + > +static inline int64_t > +rte_atomic64_sub_return(rte_atomic64_t *v, int64_t dec) > +{ > + return __sync_sub_and_fetch(&v->cnt, dec); > +} > + > +static inline int rte_atomic64_inc_and_test(rte_atomic64_t *v) > +{ > + return rte_atomic64_add_return(v, 1) == 0; > +} > + > +static inline int rte_atomic64_dec_and_test(rte_atomic64_t *v) > +{ > + return rte_atomic64_sub_return(v, 1) == 0; > +} > + > +static inline int rte_atomic64_test_and_set(rte_atomic64_t *v) > +{ > + return rte_atomic64_cmpset((volatile uint64_t *)&v->cnt, 0, 1); > +} > + > +static inline void rte_atomic64_clear(rte_atomic64_t *v) > +{ > + rte_atomic64_set(v, 0); > +} > +#endif > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_ATOMIC_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_byteorder.h b/lib/librte_eal/common/include/arch/arm/rte_byteorder.h > new file mode 100644 > index 0000000..effbd62 > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_byteorder.h > @@ -0,0 +1,146 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (C) IBM Corporation 2014. All rights reserved. > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +/* Inspired from FreeBSD src/sys/powerpc/include/endian.h > + * Copyright (c) 1987, 1991, 1993 > + * The Regents of the University of California. All rights reserved. > +*/ > + > +#ifndef _RTE_BYTEORDER_ARM32_H_ > +#define _RTE_BYTEORDER_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_byteorder.h" > + > +/* > + * An architecture-optimized byte swap for a 16-bit value. > + * > + * Do not use this function directly. The preferred function is rte_bswap16(). > + */ > +static inline uint16_t > +rte_arch_bswap16(uint16_t _x) > +{ > + return __builtin_bswap16(_x); This has been available since GCC 4.8. Our code uses assembler for this (so we can use 4.7, with using __sync_* constructs instead of __atomic_* we can support even 4.6). http://www.serverphorums.com/read.php?12,641912 Do we want to support GCC 4.6+, or is it sufficient to stay at 4.7+ or even 4.8+? > +} > + > +/* > + * An architecture-optimized byte swap for a 32-bit value. > + * > + * Do not use this function directly. The preferred function is rte_bswap32(). > + */ > +static inline uint32_t > +rte_arch_bswap32(uint32_t _x) > +{ > + return __builtin_bswap32(_x); > +} > + > +/* > + * An architecture-optimized byte swap for a 64-bit value. > + * > + * Do not use this function directly. The preferred function is rte_bswap64(). > + */ > +/* 64-bit mode */ > +static inline uint64_t > +rte_arch_bswap64(uint64_t _x) > +{ > + return __builtin_bswap64(_x); > +} > + > +#ifndef RTE_FORCE_INTRINSICS > +#define rte_bswap16(x) ((uint16_t)(__builtin_constant_p(x) ? \ > + rte_constant_bswap16(x) : \ > + rte_arch_bswap16(x))) > + > +#define rte_bswap32(x) ((uint32_t)(__builtin_constant_p(x) ? \ > + rte_constant_bswap32(x) : \ > + rte_arch_bswap32(x))) > + > +#define rte_bswap64(x) ((uint64_t)(__builtin_constant_p(x) ? \ > + rte_constant_bswap64(x) : \ > + rte_arch_bswap64(x))) > +#else > + /* > + * __builtin_bswap16 is only available gcc 4.8 and upwards > + */ > +#if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 8) > +#define rte_bswap16(x) ((uint16_t)(__builtin_constant_p(x) ? \ > + rte_constant_bswap16(x) : \ > + rte_arch_bswap16(x))) > +#endif > +#endif > + > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN > + > +#define rte_cpu_to_le_16(x) (x) > +#define rte_cpu_to_le_32(x) (x) > +#define rte_cpu_to_le_64(x) (x) > + > +#define rte_cpu_to_be_16(x) rte_bswap16(x) > +#define rte_cpu_to_be_32(x) rte_bswap32(x) > +#define rte_cpu_to_be_64(x) rte_bswap64(x) > + > +#define rte_le_to_cpu_16(x) (x) > +#define rte_le_to_cpu_32(x) (x) > +#define rte_le_to_cpu_64(x) (x) > + > +#define rte_be_to_cpu_16(x) rte_bswap16(x) > +#define rte_be_to_cpu_32(x) rte_bswap32(x) > +#define rte_be_to_cpu_64(x) rte_bswap64(x) > + > +#else /* RTE_BIG_ENDIAN */ > + > +#define rte_cpu_to_le_16(x) rte_bswap16(x) > +#define rte_cpu_to_le_32(x) rte_bswap32(x) > +#define rte_cpu_to_le_64(x) rte_bswap64(x) > + > +#define rte_cpu_to_be_16(x) (x) > +#define rte_cpu_to_be_32(x) (x) > +#define rte_cpu_to_be_64(x) (x) > + > +#define rte_le_to_cpu_16(x) rte_bswap16(x) > +#define rte_le_to_cpu_32(x) rte_bswap32(x) > +#define rte_le_to_cpu_64(x) rte_bswap64(x) > + > +#define rte_be_to_cpu_16(x) (x) > +#define rte_be_to_cpu_32(x) (x) > +#define rte_be_to_cpu_64(x) (x) > +#endif > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_BYTEORDER_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_cpuflags.h b/lib/librte_eal/common/include/arch/arm/rte_cpuflags.h > new file mode 100644 > index 0000000..411ca37 > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_cpuflags.h > @@ -0,0 +1,138 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (C) IBM Corporation 2014. All rights reserved. > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_CPUFLAGS_ARM32_H_ > +#define _RTE_CPUFLAGS_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include > +#include > +#include > +#include > +#include > + > +#include "generic/rte_cpuflags.h" > + > +/* Symbolic values for the entries in the auxiliary table */ > +#define AT_HWCAP 16 > +#define AT_HWCAP2 26 > +#define AT_PLATFORM 15 > + > +/* software based registers */ > +enum cpu_register_t { > + REG_HWCAP = 0, > + AARCH_MODE, > +}; Our code can also read REG_HWCAP2 but not the AT_PLATFORM. > + > +/** > + * Enumeration of all CPU features supported > + */ > +enum rte_cpu_flag_t { > + RTE_CPUFLAG_FP = 0, > + RTE_CPUFLAG_ASIMD, > + RTE_CPUFLAG_EVTSTRM, > + RTE_CPUFLAG_AARCH64, > + RTE_CPUFLAG_AARCH32, Nice, you can detect 32/64 bit ARM architecture here. We didn't even try. > + /* The last item */ > + RTE_CPUFLAG_NUMFLAGS,/**< This should always be the last! */ > +}; > + > +static const struct feature_entry cpu_feature_table[] = { > + FEAT_DEF(FP, 0x00000001, 0, REG_HWCAP, 0) > + FEAT_DEF(ASIMD, 0x00000001, 0, REG_HWCAP, 1) > + FEAT_DEF(EVTSTRM, 0x00000001, 0, REG_HWCAP, 2) > + FEAT_DEF(AARCH64, 0x00000001, 0, AARCH_MODE, 3) > + FEAT_DEF(AARCH32, 0x00000001, 0, AARCH_MODE, 4) > +}; > + > +/* > + * Read AUXV software register and get cpu features for Power > + */ > +static inline void > +rte_cpu_get_features(__attribute__((unused)) uint32_t leaf, > + __attribute__((unused)) uint32_t subleaf, cpuid_registers_t out) > +{ > + int auxv_fd; > + Elf32_auxv_t auxv; > + > + auxv_fd = open("/proc/self/auxv", O_RDONLY); > + assert(auxv_fd); > + while (read(auxv_fd, &auxv, > + sizeof(Elf32_auxv_t)) == sizeof(Elf32_auxv_t)) { > + if (auxv.a_type == AT_HWCAP) > + out[REG_HWCAP] = auxv.a_un.a_val; > + if (auxv.a_type == AT_PLATFORM) { > + if (strcmp((const char *)auxv.a_un.a_val, "aarch64") > + == 0) > + out[AARCH_MODE] = (1 << 3); > + else if (strcmp((const char *)auxv.a_un.a_val, > + "aarch32") == 0) > + out[AARCH_MODE] = (1 << 4); Why there are such strange values 1 << 3 and 1 << 4? > + } > + } > +} > + > +/* > + * Checks if a particular flag is available on current machine. > + */ > +static inline int > +rte_cpu_get_flag_enabled(enum rte_cpu_flag_t feature) > +{ > + const struct feature_entry *feat; > + cpuid_registers_t regs = {0}; > + > + if (feature >= RTE_CPUFLAG_NUMFLAGS) > + /* Flag does not match anything in the feature tables */ > + return -ENOENT; > + > + feat = &cpu_feature_table[feature]; > + > + if (!feat->leaf) > + /* This entry in the table wasn't filled out! */ > + return -EFAULT; > + > + /* get the cpuid leaf containing the desired feature */ > + rte_cpu_get_features(feat->leaf, feat->subleaf, regs); > + > + /* check if the feature is enabled */ > + return (regs[feat->reg] >> feat->bit) & 1; > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_CPUFLAGS_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_cycles.h b/lib/librte_eal/common/include/arch/arm/rte_cycles.h > new file mode 100644 > index 0000000..140d1bb > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_cycles.h > @@ -0,0 +1,77 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (C) IBM Corporation 2014. All rights reserved. > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > [...] > + */ > + > +#ifndef _RTE_CYCLES_ARM32_H_ > +#define _RTE_CYCLES_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_cycles.h" > + > +/** > + * Read the time base register. > + * > + * @return > + * The time base for this lcore. > + */ > +static inline uint64_t > +rte_rdtsc(void) > +{ > + unsigned tsc; > + uint64_t final_tsc; > + > + /* Read PMCCNTR */ > + asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(tsc)); > + /* 1 tick = 64 clocks */ > + final_tsc = ((uint64_t)tsc) << 6; > + > + return (uint64_t)final_tsc; > +} Are you sure, the PMCCNTR is not used by the Linux Kernel? Would we break something there? At least, the arch/arm/kernel/perf_event_v7.c does use the counter and it has routines for both reading and writing it. So, I am afraid that we can break perf by this approach... Our one, not the very best... +static inline uint64_t +rte_rdtsc(void) +{ + struct timespec val; + uint64_t v; + + while (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) + /* no body */; + + v = (uint64_t) val.tv_sec * 1000000000LL; + v += (uint64_t) val.tv_nsec; + return v; +} We can make it configurable. If somebody doesn't care about the perf support but likes the lower-latency PMU counter, it might be possible to switch for another implementation. > + > +static inline uint64_t > +rte_rdtsc_precise(void) > +{ > + asm volatile("isb sy" : : : ); > + return rte_rdtsc(); > +} > + > +static inline uint64_t > +rte_get_tsc_cycles(void) { return rte_rdtsc(); } > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_CYCLES_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy.h > new file mode 100644 > index 0000000..341052e > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy.h > @@ -0,0 +1,101 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (C) IBM Corporation 2014. All rights reserved. > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_MEMCPY_ARM32_H_ > +#define _RTE_MEMCPY_ARM32_H_ > + > +#include > +#include > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_memcpy.h" > + > + > + > +static inline void > +rte_mov16(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 16); > +} > + > +static inline void > +rte_mov32(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 32); > +} > + > +static inline void > +rte_mov48(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 48); > +} > + > +static inline void > +rte_mov64(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 64); > +} > + > +static inline void > +rte_mov128(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 128); > +} > + > +static inline void > +rte_mov256(uint8_t *dst, const uint8_t *src) > +{ > + memcpy(dst, src, 256); > +} > + > +static inline void * > +rte_memcpy(void *dst, const void *src, size_t n) > +{ > + return memcpy(dst, src, n); > +} > + > + > +static inline void * > +rte_memcpy_func(void *dst, const void *src, size_t n) > +{ > + return memcpy(dst, src, n); > +} Here we do better with the NEON optimizations. However, I am not quite sure how to behave on architectures _without_ NEON. Second, if you look at the statistics we made (http://dpdk.org/dev/patchwork/patch/7389/) you can see that the optimization makes sense for Cortex-A15 only. For the other cores (A7, A9), it helps for lengths <= 32 B. Should we do some runtime check for NEON presence? Should we do some runtime check for the current ARM machine type? Should we optimize only for the specific lengths? > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_MEMCPY_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_prefetch.h b/lib/librte_eal/common/include/arch/arm/rte_prefetch.h > new file mode 100644 > index 0000000..47be3b8 > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_prefetch.h > @@ -0,0 +1,64 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_PREFETCH_ARM32_H_ > +#define _RTE_PREFETCH_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_prefetch.h" > + > +static inline void > +rte_prefetch0(const volatile void __attribute__((unused)) *p) > +{ > + asm volatile("nop"); > +} > + > +static inline void > +rte_prefetch1(const volatile void __attribute__((unused)) *p) > +{ > + asm volatile("nop"); > +} > + > +static inline void > +rte_prefetch2(const volatile void __attribute__((unused)) *p) > +{ > + asm volatile("nop"); > +} We use PLD instruction here. Probably, we can use __builtin_prefetch. > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_PREFETCH_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_rwlock.h b/lib/librte_eal/common/include/arch/arm/rte_rwlock.h > new file mode 100644 > index 0000000..87349b9 > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_rwlock.h > @@ -0,0 +1,70 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_RWLOCK_ARM32_H_ > +#define _RTE_RWLOCK_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include "generic/rte_rwlock.h" > + > +static inline void > +rte_rwlock_read_lock_tm(rte_rwlock_t *rwl) > +{ > + rte_rwlock_read_lock(rwl); > +} > + > +static inline void > +rte_rwlock_read_unlock_tm(rte_rwlock_t *rwl) > +{ > + rte_rwlock_read_unlock(rwl); > +} > + > +static inline void > +rte_rwlock_write_lock_tm(rte_rwlock_t *rwl) > +{ > + rte_rwlock_write_lock(rwl); > +} > + > +static inline void > +rte_rwlock_write_unlock_tm(rte_rwlock_t *rwl) > +{ > + rte_rwlock_write_unlock(rwl); > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_RWLOCK_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h > new file mode 100644 > index 0000000..45d01a0 > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h > @@ -0,0 +1,116 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (C) IBM Corporation 2014. All rights reserved. > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_SPINLOCK_ARM32_H_ > +#define _RTE_SPINLOCK_ARM32_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include > + > +#include "generic/rte_spinlock.h" > + > + > +#ifndef RTE_FORCE_INTRINSICS > + > +static inline void > +rte_spinlock_lock(rte_spinlock_t *sl) > +{ > + while (__sync_lock_test_and_set(&sl->locked, 1)) > + while (sl->locked) > + rte_pause(); > +} > + > +static inline void > +rte_spinlock_unlock(rte_spinlock_t *sl) > +{ > + __sync_lock_release(&sl->locked); > +} > + > +static inline int > +rte_spinlock_trylock(rte_spinlock_t *sl) > +{ > + return (__sync_lock_test_and_set(&sl->locked, 1) == 0); > +} > + > +#endif > + > +static inline int > +rte_tm_supported(void) > +{ > + return 0; > +} > + > +static inline void > +rte_spinlock_lock_tm(rte_spinlock_t *sl) > +{ > + rte_spinlock_lock(sl); /* fall-back */ > +} > + > +static inline int > +rte_spinlock_trylock_tm(rte_spinlock_t *sl) > +{ > + return rte_spinlock_trylock(sl); > +} > + > +static inline void > +rte_spinlock_unlock_tm(rte_spinlock_t *sl) > +{ > + rte_spinlock_unlock(sl); > +} > + > +static inline void > +rte_spinlock_recursive_lock_tm(rte_spinlock_recursive_t *slr) > +{ > + rte_spinlock_recursive_lock(slr); /* fall-back */ > +} > + > +static inline void > +rte_spinlock_recursive_unlock_tm(rte_spinlock_recursive_t *slr) > +{ > + rte_spinlock_recursive_unlock(slr); > +} > + > +static inline int > +rte_spinlock_recursive_trylock_tm(rte_spinlock_recursive_t *slr) > +{ > + return rte_spinlock_recursive_trylock(slr); > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_SPINLOCK_ARM32_H_ */ > diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h b/lib/librte_eal/common/include/arch/arm/rte_vect.h > new file mode 100644 > index 0000000..5994efd > --- /dev/null > +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h > @@ -0,0 +1,37 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > +#ifndef _RTE_VECT_ARM32_H_ > +#define _RTE_VECT_ARM32_H_ > + > + > +#endif /* _RTE_VECT_ARM32_H_*/ > diff --git a/lib/librte_eal/linuxapp/Makefile b/lib/librte_eal/linuxapp/Makefile > index d9c5233..549f2ed 100644 > --- a/lib/librte_eal/linuxapp/Makefile > +++ b/lib/librte_eal/linuxapp/Makefile > @@ -35,6 +35,9 @@ ifeq ($(CONFIG_RTE_EAL_IGB_UIO),y) > DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += igb_uio > endif > DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal > +ifeq ($(CONFIG_RTE_ARCH), "arm") > +DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += arm_pmu > +endif > ifeq ($(CONFIG_RTE_KNI_KMOD),y) > DIRS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += kni > endif > diff --git a/lib/librte_eal/linuxapp/arm_pmu/Makefile b/lib/librte_eal/linuxapp/arm_pmu/Makefile > new file mode 100755 > index 0000000..0adb23b > --- /dev/null > +++ b/lib/librte_eal/linuxapp/arm_pmu/Makefile > @@ -0,0 +1,52 @@ > +# BSD LICENSE > +# > +# Copyright(c) 2015 Intel Corporation. All rights reserved. > +# > [...] > + > +include $(RTE_SDK)/mk/rte.vars.mk > + > +# > +# module name and path > +# > +MODULE = arm_pmu > +MODULE_PATH = drivers/net/arm_pmu > + > +# > +# CFLAGS > +# > +MODULE_CFLAGS += -I$(SRCDIR) --param max-inline-insns-single=100 > +MODULE_CFLAGS += -I$(RTE_OUTPUT)/include > +MODULE_CFLAGS += -Winline -Wall -Werror > +MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h > + > +# > +# all source are stored in SRCS-y > +# > +SRCS-y := rte_enable_pmu.c > + > +include $(RTE_SDK)/mk/rte.module.mk > diff --git a/lib/librte_eal/linuxapp/arm_pmu/rte_enable_pmu.c b/lib/librte_eal/linuxapp/arm_pmu/rte_enable_pmu.c > new file mode 100644 > index 0000000..9d77271 > --- /dev/null > +++ b/lib/librte_eal/linuxapp/arm_pmu/rte_enable_pmu.c > @@ -0,0 +1,83 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. All rights reserved. > + * > [...] > + */ > + > + /** > + * @file > + * This file enables the ARM PMU. > + * > + * It contains the code to enable the usage of PMU registers > + * by application code on all cores. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +void enable_pmu(void __attribute__((unused)) *info) > +{ > + int core_id = 0; > + > + core_id = smp_processor_id(); > + /* Configure PMUSERENR - Enables user mode to have access to the > + * Performance Monitor Registers. */ > + asm volatile("mcr p15, 0, %0, c9, c14, 0" : : "r"(1)); > + /* Configure PMCR */ > + asm volatile("mcr p15, 0, %0, c9, c12, 0" : : "r"(29)); > + /* Configure PMCNTENSET Enable counters */ > + asm volatile("mcr p15, 0, %0, c9, c12, 1" : : "r"(0x8000000f)); > + return; > +} > + > +int __init init_module_set_arm_pmu(void) > +{ > + int ret = 0; > + /* smp_call_function will make sure that the registers > + * for all the cores are enabled for tracking their PMUs > + * This is required for the rdtsc equivalent on ARMv7 */ > + ret = smp_call_function(&enable_pmu, NULL, 0); > + /* smp_call_function would only call on all other modules > + * so call it for the current module */ > + enable_pmu(NULL); > + return ret; > +} > + > +void __exit cleanup_module_set_arm_pmu(void) > +{ > +} > + > +module_init(init_module_set_arm_pmu); > +module_exit(cleanup_module_set_arm_pmu); Nice, but as I've already stated, setting the PMU can break the perf. Otherwise, it is quite nice solution. Anyway, a kernel module will be necessary for the ARM machines utilizing the internal EMACs. We have to deal with I/O coherency, phyters and clocks control. The I/O coherency might not be an issue for Cortex-A15 (and newer) as there is a cache-coherent interconnect available (using AMBA ACE extension). However, I didn't find out, which SoCs integrates this interconnect. For older SoCs, we have to deal with this. I am also thinking of using the non-cachable memory only, at least at the beginning. This needs to set the proper MMU/TLB flags to the allocated huge pages. The processing will be slow this way, but there will be no need for other system calls to the kernel during sending/receiving packets. I've measured that accessing only few bytes (say 128-512 B) in non-cacheable areas does not impact the processing that much, so it could be sufficient for tasks like L2/L3 forwarding, etc. But, I have no idea, how to properly access the MMU/TLB settings of the selected (or all?) huge pages. This will be a hack into the Linux Kernel. > diff --git a/mk/arch/arm/rte.vars.mk b/mk/arch/arm/rte.vars.mk > new file mode 100644 > index 0000000..3dfeb16 > --- /dev/null > +++ b/mk/arch/arm/rte.vars.mk > @@ -0,0 +1,58 @@ > +# BSD LICENSE > +# > +# Copyright(c) 2015 Intel Corporation. All rights reserved. > +# > [...] > +# > +# arch: > +# > +# - define ARCH variable (overriden by cmdline or by previous > +# optional define in machine .mk) > +# - define CROSS variable (overriden by cmdline or previous define > +# in machine .mk) > +# - define CPU_CFLAGS variable (overriden by cmdline or previous > +# define in machine .mk) > +# - define CPU_LDFLAGS variable (overriden by cmdline or previous > +# define in machine .mk) > +# - define CPU_ASFLAGS variable (overriden by cmdline or previous > +# define in machine .mk) > +# - may override any previously defined variable > +# > +# examples for CONFIG_RTE_ARCH: i686, x86_64, x86_64_32 > +# > + > +ARCH ?= arm > +# common arch dir in eal headers > +ARCH_DIR := arm > +CROSS ?= > + > +CPU_CFLAGS += -flax-vector-conversions https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html -flax-vector-conversions Allow implicit conversions between vectors with differing numbers of elements and/or incompatible element types. This option should not be used for new code. I have no idea what is this option about... is it necessary? > +CPU_LDFLAGS ?= > +CPU_ASFLAGS ?= We do: +CPU_CFLAGS ?= -marm -DRTE_CACHE_LINE_SIZE=64 -munaligned-access +CPU_LDFLAGS ?= +CPU_ASFLAGS ?= -felf The -marm can be probably omitted as the machine specific flags override it anyway. How is the cache line size specified for other platforms? Is it OK to go this way? https://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/ARM-Options.html#ARM-Options -munaligned-access -mno-unaligned-access Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned. By default unaligned access is disabled for all pre-ARMv6 and all ARMv6-M architectures, and enabled for all other architectures. If unaligned access is not enabled then words in packed data structures will be accessed a byte at a time. The ARM attribute Tag_CPU_unaligned_access will be set in the generated object file to either true or false, depending upon the setting of this option. If unaligned access is enabled then the preprocessor symbol __ARM_FEATURE_UNALIGNED will also be defined. > + > +export ARCH CROSS CPU_CFLAGS CPU_LDFLAGS CPU_ASFLAGS > diff --git a/mk/machine/armv7-a/rte.vars.mk b/mk/machine/armv7-a/rte.vars.mk > new file mode 100644 > index 0000000..feeebfa > --- /dev/null > +++ b/mk/machine/armv7-a/rte.vars.mk > @@ -0,0 +1,63 @@ > +# BSD LICENSE > +# > +# Copyright(c) 2015 Intel Corporation. All rights reserved. > +# > [...] > + > +# > +# machine: > +# > +# - can define ARCH variable (overriden by cmdline value) > +# - can define CROSS variable (overriden by cmdline value) > +# - define MACHINE_CFLAGS variable (overriden by cmdline value) > +# - define MACHINE_LDFLAGS variable (overriden by cmdline value) > +# - define MACHINE_ASFLAGS variable (overriden by cmdline value) > +# - can define CPU_CFLAGS variable (overriden by cmdline value) that > +# overrides the one defined in arch. > +# - can define CPU_LDFLAGS variable (overriden by cmdline value) that > +# overrides the one defined in arch. > +# - can define CPU_ASFLAGS variable (overriden by cmdline value) that > +# overrides the one defined in arch. > +# - may override any previously defined variable > +# > + > +# ARCH = > +# CROSS = > +# MACHINE_CFLAGS = > +# MACHINE_LDFLAGS = > +# MACHINE_ASFLAGS = > +# CPU_CFLAGS = > +# CPU_LDFLAGS = > +# CPU_ASFLAGS = > + > +MACHINE_CFLAGS += -march=armv7-a We do: +CPU_CFLAGS += -mfloat-abi=softfp + +MACHINE_CFLAGS = -march=armv7-a -mtune=cortex-a9 +MACHINE_CFLAGS += -mfpu=neon This is quite specific. How to deal with those flags? Should we have a specific config for Cortex A7, A9, A15, ...? What about NEON? > + > +#Tried this for gdb - did not work - Dwarf version error seen > +# we need a gdb that can understand dwarf version 4 > +#CPU_CFLAGS += -g -gdwarf-2 -gstrict-dwarf > +#CPU_CFLAGS := $(filter-out -O3,$(CPU_CFLAGS)) > +#CPU_CFLAGS := $(filter-out -O2,$(CPU_CFLAGS)) I can see, those lines are commented out. However, no idea about those, debugging should not be enabled by default, is it true? Is there a way in DPDK how to enable debugging at the compiler level? I do not understand what is the goal of the -O2/-O3 lines... > diff --git a/mk/toolchain/gcc/rte.vars.mk b/mk/toolchain/gcc/rte.vars.mk > index 0f51c66..501fce5 100644 > --- a/mk/toolchain/gcc/rte.vars.mk > +++ b/mk/toolchain/gcc/rte.vars.mk > @@ -1,7 +1,6 @@ > # BSD LICENSE > # > -# Copyright(c) 2010-2014 Intel Corporation. All rights reserved. > -# All rights reserved. > +# Copyright(c) 2010-2015 Intel Corporation. All rights reserved. > # > # Redistribution and use in source and binary forms, with or without > # modification, are permitted provided that the following conditions > @@ -73,7 +72,10 @@ endif > > WERROR_FLAGS := -W -Wall -Werror -Wstrict-prototypes -Wmissing-prototypes > WERROR_FLAGS += -Wmissing-declarations -Wold-style-definition -Wpointer-arith > -WERROR_FLAGS += -Wcast-align -Wnested-externs -Wcast-qual > +ifneq ($(CONFIG_RTE_ARCH), "arm") > +WERROR_FLAGS += -Wcast-align > +endif > +WERROR_FLAGS += -Wnested-externs -Wcast-qual > WERROR_FLAGS += -Wformat-nonliteral -Wformat-security > WERROR_FLAGS += -Wundef -Wwrite-strings > When you start compiling the libraries, the -Wcast-align will break the build (at least, it does for us). Here is, how we do after a short discussion with Thomas M. +# There are many issues reported for ARMv7 architecture +# which are not necessarily fatal. Report as warnings. +ifeq ($(CONFIG_RTE_ARCH_ARMv7),y) +WERROR_FLAGS += -Wno-error +endif We should identify the alignment issues and fix them somehow(?). -- Jan Viktorin E-mail: Viktorin@RehiveTech.com System Architect Web: www.RehiveTech.com RehiveTech Brno, Czech Republic