On Thu, Oct 9, 2025 at 10:11 PM Bruce Richardson <bruce.richardson@intel.com> wrote:
On Wed, Jul 16, 2025 at 04:04:39PM +0530, Shreesh Adiga wrote:
> Replace the clearing of lower 32 bits of XMM register with blend of
> zero register.
> Replace the clearing of upper 64 bits of XMM register with _mm_move_epi64.
> Clang is able to optimize away the AND + memory operand with the
> above sequence, however GCC is still emitting the code for AND with
> memory operands which is being explicitly eliminated here.
>
> Additionally replace the 48 byte crc_xmm_shift_tab with the contents of
> shf_table which is 32 bytes, achieving the same functionality.
>
> Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
> ---
>  lib/net/net_crc_sse.c | 30 +++++++-----------------------
>  1 file changed, 7 insertions(+), 23 deletions(-)
>

See inline below. Changes to the reduce_64_to_32 look ok, I don't know
enough to understand fully the other changes you made. Maybe split the
patch into two patches for review and merge separately?

/Bruce

> diff --git a/lib/net/net_crc_sse.c b/lib/net/net_crc_sse.c
> index 112dc94ac1..eec854e587 100644
> --- a/lib/net/net_crc_sse.c
> +++ b/lib/net/net_crc_sse.c
> @@ -96,20 +96,13 @@ crcr32_reduce_128_to_64(__m128i data128, __m128i precomp)
>  static __rte_always_inline uint32_t
>  crcr32_reduce_64_to_32(__m128i data64, __m128i precomp)
>  {
> -     static const alignas(16) uint32_t mask1[4] = {
> -             0xffffffff, 0xffffffff, 0x00000000, 0x00000000
> -     };
> -
> -     static const alignas(16) uint32_t mask2[4] = {
> -             0x00000000, 0xffffffff, 0xffffffff, 0xffffffff
> -     };
>       __m128i tmp0, tmp1, tmp2;

> -     tmp0 = _mm_and_si128(data64, _mm_load_si128((const __m128i *)mask2));
> +     tmp0 = _mm_blend_epi16(_mm_setzero_si128(), data64, 252);

Minor nit: 252 would be better in hex to make it clearer that it's the
lower two bits are unset. Even better, how about switching the operands so
that the constant is just "3", which is clearer again.
Okay I will update it with your suggestion in the next patch. 



>       tmp1 = _mm_clmulepi64_si128(tmp0, precomp, 0x00);
>       tmp1 = _mm_xor_si128(tmp1, tmp0);
> -     tmp1 = _mm_and_si128(tmp1, _mm_load_si128((const __m128i *)mask1));
> +     tmp1 = _mm_move_epi64(tmp1);


This change LGTM.

>       tmp2 = _mm_clmulepi64_si128(tmp1, precomp, 0x10);
>       tmp2 = _mm_xor_si128(tmp2, tmp1);
> @@ -118,13 +111,11 @@ crcr32_reduce_64_to_32(__m128i data64, __m128i precomp)
>       return _mm_extract_epi32(tmp2, 2);
>  }

> -static const alignas(16) uint8_t crc_xmm_shift_tab[48] = {
> -     0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
> -     0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
> +static const alignas(16) uint8_t crc_xmm_shift_tab[32] = {
> +     0x00, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
> +     0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
>       0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> -     0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
> -     0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
> -     0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
> +     0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
>  };


Can you perhaps explain how changing this table doesn't break existing uses
of the table as it now is in the code? Specifically, does xmm_shift_left
function not now have different behaviour?
Sure, crc_xmm_shift_tab is only used inside xmm_shift_left which is only used when
the total data_len is < 16. We call xmm_shift_left(fold, 8 - data_len) when len <= 4 and
xmm_shift_left(fold, 16 - data_len) when 5 <= len <= 15. This results in accessing
crc_xmm_shift_tab between 1 and 31, i.e. element 0 is never accessed.

Now if we take a specific case of xmm_shift_left(fold, 10), then previously the shuffle register
would get loaded with (crc_xmm_shift_tab + 16 - 10) which would be crc_xmm_shift_tab + 6:
{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05} which
when used with PSHUFB would result in first 10 bytes of reg being 0 and the lower 6
elements moving left: {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, d1, d2, d3, d4, d5, d6}. The 0 get inserted
because PSHUFB with index > 0x7f (MSB set) results in 0.

Now since we have replaced the contents of crc_xmm_shift_tab with shf_table, we will load:
{0x86, 0x87, 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05}
into the index register. Since the first 10 elements have MSB set, PSHUFB will again result in the 
same vector above with first 10 elements being zeroes and the 6 moving left as intended.
Since xmm_shift_left is called with num between 11 and 1, we don't access crc_xmm_shift_tab[0]
and the remaining elements 0x8{i} behave identically to 0xff when used with PSHUFB. Thus
the xmm_shift_left behavior for currently used num values inside this file has identical behavior.
 

>  /**
> @@ -216,19 +207,12 @@ crc32_eth_calc_pclmulqdq(
>                       0x80808080, 0x80808080, 0x80808080, 0x80808080
>               };

> -             const alignas(16) uint8_t shf_table[32] = {
> -                     0x00, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
> -                     0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
> -                     0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> -                     0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
> -             };
> -
>               __m128i last16, a, b;

>               last16 = _mm_loadu_si128((const __m128i *)&data[data_len - 16]);

>               temp = _mm_loadu_si128((const __m128i *)
> -                     &shf_table[data_len & 15]);
> +                     &crc_xmm_shift_tab[data_len & 15]);
>               a = _mm_shuffle_epi8(fold, temp);

>               temp = _mm_xor_si128(temp,
> --
> 2.49.1
>