From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4C59041CF0; Mon, 20 Feb 2023 21:44:46 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id CEE8543098; Mon, 20 Feb 2023 21:44:45 +0100 (CET) Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by mails.dpdk.org (Postfix) with ESMTP id BDDB240A7A for ; Mon, 20 Feb 2023 21:44:43 +0100 (CET) Received: by mail-ed1-f51.google.com with SMTP id da10so10451104edb.3 for ; Mon, 20 Feb 2023 12:44:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AkRPawYsjuP099B9r3b2lGgolWy+zn0je4EVFSTqluo=; b=FPJGNaQWDYSjcnQMgNShGZdYCnNbed937vc35Py7hF7LOfxgE7vIwTNmncByWjImv9 goiGaN/b1BdX+yPLTnl+SrKPEi0H94/7rMuSTteawyKwHEYxboFg/BfmZVhX92hWjuie Fy05JfL7pWraNlCSPgUW/ZRv47id6yR8M4HjVKilWJDBOlnSbt/zANNC8nFsco60xomX 11dNjIjhaCwFMi65osLhChzwudH6KjiFCRM6fyI5zwElYKSfhvLgjrPPPMV5BN67RvzD GpGmpggBzsV6xPij3d3IAqp3rWjOD1jy1fKVFK4cdCh8oIVv5lItLoTmgf7+KtJZHHpC mUAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AkRPawYsjuP099B9r3b2lGgolWy+zn0je4EVFSTqluo=; b=I583dakp0BRMAgux1h8Bi58htph80LxfdMz92lCcC0M69w6u98OvmIm5uCR6p/4oQn wbUJBQADngGIU6vBLYzC3gIfS6klTfXHLV1FvDgOkMOxLIhOIvT8gCqCB6F4yqoep2L6 NSnxojRmHhH7Qi9PDsh8SNWBBXR0S1OIitIi/YamSeERxmXDACC5T8qjJ4pNVed/Y+II tl5xriazXcLa08TznB2qbF/aBlYvf6pThxUUsieNb94cGvez9/j15HQZDbgvRxluxGEz wEYyi+q8WaZ82IrothCRWxGeSo/PSOr/J/JlAkDCMjy0sLHHCLDWs1sjA3hqATuJ81KP 9vVA== X-Gm-Message-State: AO0yUKUpkeP+20vj0/bKMnu8rHFtEIjIqhZp39smeFlQGjZ7XFqu5MGS AltSIQmou2un4mGafM4CFn6cqkEtn8ODmmfax1w= X-Google-Smtp-Source: AK7set9uOlujzsfhKGJpvXMPiZ6sAUJwpFfo4wg0XDmx9auT+VqSfsDVjzwIufvzfd9BjV0YQ9cU2aEtqt9xn3BToSQ= X-Received: by 2002:a17:906:76d0:b0:8de:e66a:d788 with SMTP id q16-20020a17090676d000b008dee66ad788mr2947ejn.7.1676925883346; Mon, 20 Feb 2023 12:44:43 -0800 (PST) MIME-Version: 1.0 References: <20230215105442.3878441-1-qobilidop@gmail.com> <20230215110630.3885175-1-qobilidop@gmail.com> In-Reply-To: From: Bili Dong Date: Mon, 20 Feb 2023 12:44:06 -0800 Message-ID: Subject: Re: [PATCH v3] hash: add XOR32 hash function To: "Dumitrescu, Cristian" Cc: "Richardson, Bruce" , "Gobriel, Sameh" , "thomas@monjalon.net" , "dev@dpdk.org" , "Wang, Yipeng1" , "Medvedkin, Vladimir" Content-Type: multipart/alternative; boundary="00000000000097748405f527bb45" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org --00000000000097748405f527bb45 Content-Type: text/plain; charset="UTF-8" Hi Cristian, I agree the 64-bit version could enable better performance, and I will do it in the next version. For endianness, see my comments below (inline): On Mon, Feb 20, 2023 at 12:19 PM Dumitrescu, Cristian < cristian.dumitrescu@intel.com> wrote: > HI Bili, > > Comments inline below: > > > > > diff --git a/lib/hash/rte_hash_xor.h b/lib/hash/rte_hash_xor.h > > new file mode 100644 > > index 0000000000..7004f83ec2 > > --- /dev/null > > +++ b/lib/hash/rte_hash_xor.h > > @@ -0,0 +1,65 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2023 Intel Corporation > > + */ > > + > > +#ifndef _RTE_HASH_XOR_H_ > > +#define _RTE_HASH_XOR_H_ > > + > > +/** > > + * @file > > + * > > + * RTE XOR Hash > > + */ > > + > > +#ifdef __cplusplus > > +extern "C" { > > +#endif > > + > > +#include > > + > > +#include > > + > > +#define LEFT8b_MASK rte_cpu_to_be_32(0xff000000) > > I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to > me the correct function to use here is be_to_cpu(), as the for loop in the > function below works with values in the CPU endianness. Would you agree? > What I have in mind is the 0xff000000 literal is in CPU endianness, and I'm converting it to big-endian. For performance reasons, I think it's better to work in big-endian in the loop. Also as Vladimir suggested above, I'll remove these masks and switch to using shifts directly. So I guess this won't matter anymore. > > > +#define LEFT16b_MASK rte_cpu_to_be_32(0xffff0000) > > I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to > me the correct function to use here is be_to_cpu(), as the for loop in the > function below works with values in the CPU endianness. Would you agree? > > > + > > +/** > > + * Calculate XOR32 hash on user-supplied byte array. > > + * > > + * @param data > > + * Data to perform hash on. > > + * @param data_len > > + * How many bytes to use to calculate hash value. > > + * @param init_val > > + * Value to initialise hash generator. > > + * @return > > + * 32bit calculated hash value. > > + */ > > +static inline uint32_t > > +rte_hash_xor(const void *data, uint32_t data_len, uint32_t init_val) > > +{ > > + uint32_t i; > > + uintptr_t pd = (uintptr_t) data; > > + init_val = rte_cpu_to_be_32(init_val); > I don't think this is correct, I the correct version here is to remove the > above assignment, as I think the intention of this function (for > performance reasons) is to do the endianness conversion only when needed, > which is once at the end of the function (and also when handling the 2-byte > and 1-byte left-overs). > What I have in mind is to convert init_val to big-endian once here, instead of having to convert every byte array chunks from big-endian to host endian in the loop (more conversions, worse performance, see comment below). > > > + > > + for (i = 0; i < data_len / 4; i++) { > > + init_val ^= *(const uint32_t *)pd; > Look at this line, if init_val is still in host endianness, we need to do init_val ^= rte_be_to_cpu_32(*(const uint32_t *)pd) instead. I think the essential tradeoff here is that, if we convert init_val from host to be and back, we pay the constant cost of 2 conversions. The alternative is to convert byte array chunks from be to host. The number of conversions depends on data_len. Now I think of it more, maybe the best option is to condition on the value of data_len, and choose the method with fewer conversions. > > + pd += 4; > > + } > > + > > + if (data_len & 0x2) { > > + init_val ^= *(const uint32_t *)pd & LEFT16b_MASK; > > + pd += 2; > > + } > > + > > + if (data_len & 0x1) > > + init_val ^= *(const uint32_t *)pd & LEFT8b_MASK; > > + > > + init_val = rte_be_to_cpu_32(init_val); > > + return init_val; > > +} > > + > > Due to the XOR properties (endianness-insensitivity, no carry bits, etc) > and for performance reasons, I would also recommend implementing a 64-bit > version of this function (while keeping the 32-bit result), similar to this: > > uint64_t *data64 = (uint64_t *)data; > uint64_t result = init_data; > > /* Read & accumulate input data in 8-byte increments. */ > for (i = 0; i < data_len / 8; i++) > result ^= *data64++; > > data_len &= 0x7; > > /* Handle any remaining bytes in the input data (up to 7 bytes). */ > if (data_len >= 4) { > uint32_t *data32 = (uint32_t *)data64; > > /* Read and accumulate the next 4 bytes from the input data. */ > result ^= *data32++; > data_len -= 4; > > if (data_len >= 2) { > uint16_t *data16 = (uint16_t *)data32; > > /* Read and accumulate the next 2 bytes from the input > data. */ > result ^= *data16++ > data_len -= 2; > > if (data_len) { > uint8_t *data8 = (uint8_t *)data8; > > /* Read and accumulate the next byte from the > input data. */ > result ^= *data8; > } > } > } > > /* Accumulate the upper 32 bits on top of the lower 32 bits. */ > result ^= result >> 32; > > /* Single endianness swap at the very end. */ > return rte_cpu_to_be32((uint32_t)result); > > What do you think? > > Regards, > Cristian > > --00000000000097748405f527bb45 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Cristian,

I agree t= he 64-bit version could enable better performance, and I will do it in the = next version.

For endianness, see my comments belo= w (inline):

On Mon, Feb 20, 2023 at 12:19 PM Dumitrescu, Cristian <= cristian.dumitrescu@intel.= com> wrote:
HI Bili,

Comments inline below:

<snip>

> diff --git a/lib/hash/rte_hash_xor.h b/lib/hash/rte_hash_xor.h
> new file mode 100644
> index 0000000000..7004f83ec2
> --- /dev/null
> +++ b/lib/hash/rte_hash_xor.h
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Intel Corporation
> + */
> +
> +#ifndef _RTE_HASH_XOR_H_
> +#define _RTE_HASH_XOR_H_
> +
> +/**
> + * @file
> + *
> + * RTE XOR Hash
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdint.h>
> +
> +#include <rte_byteorder.h>
> +
> +#define LEFT8b_MASK rte_cpu_to_be_32(0xff000000)

I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to me= the correct function to use here is be_to_cpu(), as the for loop in the fu= nction below works with values in the CPU endianness. Would you agree?
<= /blockquote>
What I have in mind is the 0xff000000 literal is in CPU en= dianness, and I'm converting it to big-endian. For performance reasons,= I think it's better to work in big-endian in the loop.
Also = as=C2=A0Vladimir suggested above, I'll remove these masks and switch to= using shifts directly. So I guess this won't matter anymore.

> +#define LEFT16b_MASK rte_cpu_to_be_32(0xffff0000)

I know that cpu_to_be() and be_to_cpu() are doing the same thing, but to me= the correct function to use here is be_to_cpu(), as the for loop in the fu= nction below works with values in the CPU endianness. Would you agree?

> +
> +/**
> + * Calculate XOR32 hash on user-supplied byte array.
> + *
> + * @param data
> + *=C2=A0 =C2=A0Data to perform hash on.
> + * @param data_len
> + *=C2=A0 =C2=A0How many bytes to use to calculate hash value.
> + * @param init_val
> + *=C2=A0 =C2=A0Value to initialise hash generator.
> + * @return
> + *=C2=A0 =C2=A032bit calculated hash value.
> + */
> +static inline uint32_t
> +rte_hash_xor(const void *data, uint32_t data_len, uint32_t init_val)<= br> > +{
> +=C2=A0 =C2=A0 =C2=A0uint32_t i;
> +=C2=A0 =C2=A0 =C2=A0uintptr_t pd =3D (uintptr_t) data;
> +=C2=A0 =C2=A0 =C2=A0init_val =3D rte_cpu_to_be_32(init_val);
I don't think this is correct, I the correct version here is to remove = the above assignment, as I think the intention of this function (for perfor= mance reasons) is to do the endianness conversion only when needed, which i= s once at the end of the function (and also when handling the 2-byte and 1-= byte left-overs).
What I have in mind is to convert in= it_val to big-endian once here, instead of having to convert every byte=C2= =A0array=C2=A0chunks from=C2=A0big-endian to host endian in the loop (more = conversions, worse performance, see comment below).

> +
> +=C2=A0 =C2=A0 =C2=A0for (i =3D 0; i < data_len / 4; i++) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0init_val ^=3D *(const= uint32_t *)pd;
Look at this line, if init_val is stil= l in host endianness, we need to do init_val ^=3D rte_be_to_cpu_32(*(const = uint32_t *)pd) instead.
I think the essential tradeoff here is th= at, if we convert init_val from host to be and back, we pay the constant co= st of 2 conversions. The alternative is to convert byte array chunks from b= e to host. The number of conversions depends on data_len. Now I think of it= more, maybe the best option is to condition on the value of data_len, and = choose the method with fewer conversions.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pd +=3D 4;
> +=C2=A0 =C2=A0 =C2=A0}
> +
> +=C2=A0 =C2=A0 =C2=A0if (data_len & 0x2) {
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0init_val ^=3D *(const= uint32_t *)pd & LEFT16b_MASK;
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pd +=3D 2;
> +=C2=A0 =C2=A0 =C2=A0}
> +
> +=C2=A0 =C2=A0 =C2=A0if (data_len & 0x1)
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0init_val ^=3D *(const= uint32_t *)pd & LEFT8b_MASK;
> +
> +=C2=A0 =C2=A0 =C2=A0init_val =3D rte_be_to_cpu_32(init_val);
> +=C2=A0 =C2=A0 =C2=A0return init_val;
> +}
> +

Due to the XOR properties (endianness-insensitivity, no carry bits, etc) an= d for performance reasons, I would also recommend implementing a 64-bit ver= sion of this function (while keeping the 32-bit result), similar to this:
uint64_t *data64 =3D (uint64_t *)data;
uint64_t result =3D init_data;

/* Read & accumulate input data in 8-byte increments. */
for (i =3D 0; i < data_len / 8; i++)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 result ^=3D *data64++;

data_len &=3D 0x7;

/* Handle any remaining bytes in the input data (up to 7 bytes). */
if (data_len >=3D 4) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 uint32_t *data32 =3D (uint32_t *)data64;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Read and accumulate=C2=A0 the next 4 bytes f= rom the input data. */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 result ^=3D *data32++;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 data_len -=3D 4;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (data_len >=3D 2) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 uint16_t *data16 = =3D (uint16_t *)data32;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Read and accumul= ate the next 2 bytes from the input data. */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 result ^=3D *data16= ++
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 data_len -=3D 2;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (data_len) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 uint8_t *data8 =3D (uint8_t *)data8;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 /* Read and accumulate the next byte from the input data. */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 result ^=3D *data8;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
}

/* Accumulate the upper 32 bits on top of the lower 32 bits. */
result ^=3D result >> 32;

/* Single endianness swap at the very end. */
return rte_cpu_to_be32((uint32_t)result);

What do you think?

Regards,
Cristian

--00000000000097748405f527bb45--