Hi Cristian,

I agree the 64-bit version could enable better performance, and I will do it in the next version.

For endianness, see my comments below (inline):

On Mon, Feb 20, 2023 at 12:19 PM Dumitrescu, Cristian <cristian.dumitrescu@intel.com> wrote:
HI Bili,

Comments inline below:

<snip>

> diff --git a/lib/hash/rte_hash_xor.h b/lib/hash/rte_hash_xor.h
> new file mode 100644
> index 0000000000..7004f83ec2
> --- /dev/null
> +++ b/lib/hash/rte_hash_xor.h
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Intel Corporation
> + */
> +
> +#ifndef _RTE_HASH_XOR_H_
> +#define _RTE_HASH_XOR_H_
> +
> +/**
> + * @file
> + *
> + * RTE XOR Hash
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdint.h>
> +
> +#include <rte_byteorder.h>
> +
> +#define LEFT8b_MASK rte_cpu_to_be_32(0xff000000)

I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to me the correct function to use here is be_to_cpu(), as the for loop in the function below works with values in the CPU endianness. Would you agree?
What I have in mind is the 0xff000000 literal is in CPU endianness, and I'm converting it to big-endian. For performance reasons, I think it's better to work in big-endian in the loop.
Also as Vladimir suggested above, I'll remove these masks and switch to using shifts directly. So I guess this won't matter anymore.

> +#define LEFT16b_MASK rte_cpu_to_be_32(0xffff0000)

I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to me the correct function to use here is be_to_cpu(), as the for loop in the function below works with values in the CPU endianness. Would you agree?

> +
> +/**
> + * Calculate XOR32 hash on user-supplied byte array.
> + *
> + * @param data
> + *   Data to perform hash on.
> + * @param data_len
> + *   How many bytes to use to calculate hash value.
> + * @param init_val
> + *   Value to initialise hash generator.
> + * @return
> + *   32bit calculated hash value.
> + */
> +static inline uint32_t
> +rte_hash_xor(const void *data, uint32_t data_len, uint32_t init_val)
> +{
> +     uint32_t i;
> +     uintptr_t pd = (uintptr_t) data;
> +     init_val = rte_cpu_to_be_32(init_val);
I don't think this is correct, I the correct version here is to remove the above assignment, as I think the intention of this function (for performance reasons) is to do the endianness conversion only when needed, which is once at the end of the function (and also when handling the 2-byte and 1-byte left-overs).
What I have in mind is to convert init_val to big-endian once here, instead of having to convert every byte array chunks from big-endian to host endian in the loop (more conversions, worse performance, see comment below).

> +
> +     for (i = 0; i < data_len / 4; i++) {
> +             init_val ^= *(const uint32_t *)pd;
Look at this line, if init_val is still in host endianness, we need to do init_val ^= rte_be_to_cpu_32(*(const uint32_t *)pd) instead.
I think the essential tradeoff here is that, if we convert init_val from host to be and back, we pay the constant cost of 2 conversions. The alternative is to convert byte array chunks from be to host. The number of conversions depends on data_len. Now I think of it more, maybe the best option is to condition on the value of data_len, and choose the method with fewer conversions.
> +             pd += 4;
> +     }
> +
> +     if (data_len & 0x2) {
> +             init_val ^= *(const uint32_t *)pd & LEFT16b_MASK;
> +             pd += 2;
> +     }
> +
> +     if (data_len & 0x1)
> +             init_val ^= *(const uint32_t *)pd & LEFT8b_MASK;
> +
> +     init_val = rte_be_to_cpu_32(init_val);
> +     return init_val;
> +}
> +

Due to the XOR properties (endianness-insensitivity, no carry bits, etc) and for performance reasons, I would also recommend implementing a 64-bit version of this function (while keeping the 32-bit result), similar to this:

uint64_t *data64 = (uint64_t *)data;
uint64_t result = init_data;

/* Read & accumulate input data in 8-byte increments. */
for (i = 0; i < data_len / 8; i++)
        result ^= *data64++;

data_len &= 0x7;

/* Handle any remaining bytes in the input data (up to 7 bytes). */
if (data_len >= 4) {
        uint32_t *data32 = (uint32_t *)data64;

        /* Read and accumulate  the next 4 bytes from the input data. */
        result ^= *data32++;
        data_len -= 4;

        if (data_len >= 2) {
                uint16_t *data16 = (uint16_t *)data32;

                /* Read and accumulate the next 2 bytes from the input data. */
                result ^= *data16++
                data_len -= 2;

                if (data_len) {
                        uint8_t *data8 = (uint8_t *)data8;

                        /* Read and accumulate the next byte from the input data. */
                        result ^= *data8;
                }
        }
}

/* Accumulate the upper 32 bits on top of the lower 32 bits. */
result ^= result >> 32;

/* Single endianness swap at the very end. */
return rte_cpu_to_be32((uint32_t)result);

What do you think?

Regards,
Cristian