Hi Cristian, I agree the 64-bit version could enable better performance, and I will do it in the next version. For endianness, see my comments below (inline): On Mon, Feb 20, 2023 at 12:19 PM Dumitrescu, Cristian < cristian.dumitrescu@intel.com> wrote: > HI Bili, > > Comments inline below: > > > > > diff --git a/lib/hash/rte_hash_xor.h b/lib/hash/rte_hash_xor.h > > new file mode 100644 > > index 0000000000..7004f83ec2 > > --- /dev/null > > +++ b/lib/hash/rte_hash_xor.h > > @@ -0,0 +1,65 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2023 Intel Corporation > > + */ > > + > > +#ifndef _RTE_HASH_XOR_H_ > > +#define _RTE_HASH_XOR_H_ > > + > > +/** > > + * @file > > + * > > + * RTE XOR Hash > > + */ > > + > > +#ifdef __cplusplus > > +extern "C" { > > +#endif > > + > > +#include > > + > > +#include > > + > > +#define LEFT8b_MASK rte_cpu_to_be_32(0xff000000) > > I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to > me the correct function to use here is be_to_cpu(), as the for loop in the > function below works with values in the CPU endianness. Would you agree? > What I have in mind is the 0xff000000 literal is in CPU endianness, and I'm converting it to big-endian. For performance reasons, I think it's better to work in big-endian in the loop. Also as Vladimir suggested above, I'll remove these masks and switch to using shifts directly. So I guess this won't matter anymore. > > > +#define LEFT16b_MASK rte_cpu_to_be_32(0xffff0000) > > I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to > me the correct function to use here is be_to_cpu(), as the for loop in the > function below works with values in the CPU endianness. Would you agree? > > > + > > +/** > > + * Calculate XOR32 hash on user-supplied byte array. > > + * > > + * @param data > > + * Data to perform hash on. > > + * @param data_len > > + * How many bytes to use to calculate hash value. > > + * @param init_val > > + * Value to initialise hash generator. > > + * @return > > + * 32bit calculated hash value. > > + */ > > +static inline uint32_t > > +rte_hash_xor(const void *data, uint32_t data_len, uint32_t init_val) > > +{ > > + uint32_t i; > > + uintptr_t pd = (uintptr_t) data; > > + init_val = rte_cpu_to_be_32(init_val); > I don't think this is correct, I the correct version here is to remove the > above assignment, as I think the intention of this function (for > performance reasons) is to do the endianness conversion only when needed, > which is once at the end of the function (and also when handling the 2-byte > and 1-byte left-overs). > What I have in mind is to convert init_val to big-endian once here, instead of having to convert every byte array chunks from big-endian to host endian in the loop (more conversions, worse performance, see comment below). > > > + > > + for (i = 0; i < data_len / 4; i++) { > > + init_val ^= *(const uint32_t *)pd; > Look at this line, if init_val is still in host endianness, we need to do init_val ^= rte_be_to_cpu_32(*(const uint32_t *)pd) instead. I think the essential tradeoff here is that, if we convert init_val from host to be and back, we pay the constant cost of 2 conversions. The alternative is to convert byte array chunks from be to host. The number of conversions depends on data_len. Now I think of it more, maybe the best option is to condition on the value of data_len, and choose the method with fewer conversions. > > + pd += 4; > > + } > > + > > + if (data_len & 0x2) { > > + init_val ^= *(const uint32_t *)pd & LEFT16b_MASK; > > + pd += 2; > > + } > > + > > + if (data_len & 0x1) > > + init_val ^= *(const uint32_t *)pd & LEFT8b_MASK; > > + > > + init_val = rte_be_to_cpu_32(init_val); > > + return init_val; > > +} > > + > > Due to the XOR properties (endianness-insensitivity, no carry bits, etc) > and for performance reasons, I would also recommend implementing a 64-bit > version of this function (while keeping the 32-bit result), similar to this: > > uint64_t *data64 = (uint64_t *)data; > uint64_t result = init_data; > > /* Read & accumulate input data in 8-byte increments. */ > for (i = 0; i < data_len / 8; i++) > result ^= *data64++; > > data_len &= 0x7; > > /* Handle any remaining bytes in the input data (up to 7 bytes). */ > if (data_len >= 4) { > uint32_t *data32 = (uint32_t *)data64; > > /* Read and accumulate the next 4 bytes from the input data. */ > result ^= *data32++; > data_len -= 4; > > if (data_len >= 2) { > uint16_t *data16 = (uint16_t *)data32; > > /* Read and accumulate the next 2 bytes from the input > data. */ > result ^= *data16++ > data_len -= 2; > > if (data_len) { > uint8_t *data8 = (uint8_t *)data8; > > /* Read and accumulate the next byte from the > input data. */ > result ^= *data8; > } > } > } > > /* Accumulate the upper 32 bits on top of the lower 32 bits. */ > result ^= result >> 32; > > /* Single endianness swap at the very end. */ > return rte_cpu_to_be32((uint32_t)result); > > What do you think? > > Regards, > Cristian > >