From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 8682BA054F
	for <public@inbox.dpdk.org>; Mon, 27 Jun 2022 19:22:43 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 73F0D40685;
	Mon, 27 Jun 2022 19:22:43 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 4544540685;
 Mon, 27 Jun 2022 19:22:42 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 551241E947;
 Mon, 27 Jun 2022 19:22:41 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 538BA1E36D; Mon, 27 Jun 2022 19:22:41 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6
X-Spam-Score: -2.0
Received: from [192.168.1.59] (unknown [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest
 SHA256) (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id CFFD11EC0A;
 Mon, 27 Jun 2022 19:22:38 +0200 (CEST)
Message-ID: <8b28786c-cf78-68e2-7022-1c68a8d8d119@lysator.liu.se>
Date: Mon, 27 Jun 2022 19:22:38 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
Content-Language: en-US
To: =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 Emil Berg <emil.berg@ericsson.com>, bruce.richardson@intel.com, dev@dpdk.org
Cc: stephen@networkplumber.org, stable@dpdk.org, bugzilla@dpdk.org,
 olivier.matz@6wind.com
References: <98CBD80474FA8B44BF855DF32C47DC35D87139@smartserver.smartshare.dk>
 <20220623123900.38283-1-mb@smartsharesystems.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87169@smartserver.smartshare.dk>
 <9f543fc0-ae8a-067b-d13f-38a0503dd619@lysator.liu.se>
 <AM8PR07MB76661BE2F6EBC456AA00333098B99@AM8PR07MB7666.eurprd07.prod.outlook.com>
 <AM8PR07MB76660E55460287468718E80598B99@AM8PR07MB7666.eurprd07.prod.outlook.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87187@smartserver.smartshare.dk>
From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87187@smartserver.smartshare.dk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: stable@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: patches for DPDK stable branches <stable.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/stable>,
 <mailto:stable-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/stable/>
List-Post: <mailto:stable@dpdk.org>
List-Help: <mailto:stable-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/stable>,
 <mailto:stable-request@dpdk.org?subject=subscribe>
Errors-To: stable-bounces@dpdk.org

On 2022-06-27 15:22, Morten Brørup wrote:
>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>> Sent: Monday, 27 June 2022 14.51
>>
>>> From: Emil Berg
>>> Sent: den 27 juni 2022 14:46
>>>
>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
>>>> Sent: den 27 juni 2022 14:28
>>>>
>>>> On 2022-06-23 14:51, Morten Brørup wrote:
>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>>>> Sent: Thursday, 23 June 2022 14.39
>>>>>>
>>>>>> With this patch, the checksum can be calculated on an unaligned
>> buffer.
>>>>>> I.e. the buf parameter is no longer required to be 16 bit
>> aligned.
>>>>>>
>>>>>> The checksum is still calculated using a 16 bit aligned pointer,
>> so
>>>>>> the compiler can auto-vectorize the function's inner loop.
>>>>>>
>>>>>> When the buffer is unaligned, the first byte of the buffer is
>>>>>> handled separately. Furthermore, the calculated checksum of the
>>>>>> buffer is byte shifted before being added to the initial
>> checksum,
>>>>>> to compensate for the checksum having been calculated on the
>> buffer
>>>>>> shifted by one byte.
>>>>>>
>>>>>> v4:
>>>>>> * Add copyright notice.
>>>>>> * Include stdbool.h (Emil Berg).
>>>>>> * Use RTE_PTR_ADD (Emil Berg).
>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
>> word?
>>>>>> v3:
>>>>>> * Remove braces from single statement block.
>>>>>> * Fix typo in commit message.
>>>>>> v2:
>>>>>> * Do not assume that the buffer is part of an aligned packet
>> buffer.
>>>>>>
>>>>>> Bugzilla ID: 1035
>>>>>> Cc: stable@dpdk.org
>>>>>>
>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>> ---
>>>>>>    lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>>>>>>    1 file changed, 27 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
>>>>>> b502481670..738d643da0 100644
>>>>>> --- a/lib/net/rte_ip.h
>>>>>> +++ b/lib/net/rte_ip.h
>>>>>> @@ -3,6 +3,7 @@
>>>>>>     *      The Regents of the University of California.
>>>>>>     * Copyright(c) 2010-2014 Intel Corporation.
>>>>>>     * Copyright(c) 2014 6WIND S.A.
>>>>>> + * Copyright(c) 2022 SmartShare Systems.
>>>>>>     * All rights reserved.
>>>>>>     */
>>>>>>
>>>>>> @@ -15,6 +16,7 @@
>>>>>>     * IP-related defines
>>>>>>     */
>>>>>>
>>>>>> +#include <stdbool.h>
>>>>>>    #include <stdint.h>
>>>>>>
>>>>>>    #ifdef RTE_EXEC_ENV_WINDOWS
>>>>>> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t
>> len,
>>>>>> uint32_t sum)
>>>>>>    {
>>>>>>    	/* extend strict-aliasing rules */
>>>>>>    	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>>>>>> -	const u16_p *u16_buf = (const u16_p *)buf;
>>>>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>>>>>> +	const u16_p *u16_buf;
>>>>>> +	const u16_p *end;
>>>>>> +	uint32_t bsum = 0;
>>>>>> +	const bool unaligned = (uintptr_t)buf & 1;
>>>>>> +
>>>>>> +	/* if buffer is unaligned, keeping it byte order
>> independent */
>>>>>> +	if (unlikely(unaligned)) {
>>>>>> +		uint16_t first = 0;
>>>>>> +		if (unlikely(len == 0))
>>>>>> +			return 0;
>>>>>> +		((unsigned char *)&first)[1] = *(const unsigned
>>>> char *)buf;
>>>>>> +		bsum += first;
>>>>>> +		buf = RTE_PTR_ADD(buf, 1);
>>>>>> +		len--;
>>>>>> +	}
>>>>>>
>>>>>> +	/* aligned access for compiler auto-vectorization */
>>>>
>>>> The compiler will be able to auto vectorize even unaligned
>> accesses,
>>>> just with different instructions. From what I can tell, there's no
>>>> performance impact, at least not on the x86_64 systems I tried on.
>>>>
>>>> I think you should remove the first special case conditional and
>> use
>>>> memcpy() instead of the cumbersome __may_alias__ construct to
>> retrieve
>>>> the data.
>>>>
>>>
>>> Here:
>>> https://www.agner.org/optimize/instruction_tables.pdf
>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the latency
>> for
>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
>> difference.
>>> Although in practice I'm not sure what difference it makes. I've not
>> seen any
>>> difference in runtime between the two versions.
>>>
>>
>> Correction to my comment:
>> Those stats are for some older CPU. For some newer CPUs such as Tiger
>> Lake the stats seem to be the same regardless of aligned or unaligned.
>>
> 
> I agree that the memcpy method is more elegant and easy to read.
> 
> However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet.
> 

I think you've misunderstood what latency means in such tables. It's a 
data dependency thing, not a measure of throughput. The throughput is 
*much* higher. My guess would be two such instruction per clock.

For your 1460 bytes example, my Zen3 AMD needs performs identical with 
both the current DPDK implementation, your patch, and a memcpy()-ified 
version of the current implementation. They all need ~130 clock 
cycles/packet, with warm caches. IPC is 3 instructions per cycle, but 
obvious not all instructions are SIMD.

The main issue with checksumming on the CPU is, in my experience, not 
that you don't have enough compute, but that you trash the caches.

> So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned).
> 

You will see performance degradation with this solution as well, under 
certain conditions. For unaligned 100 bytes of data, the current DPDK 
implementation and the memcpy()-fied version needs ~21 cc/packet. Your 
patch needs 54 cc/packet.

But the old version didn't support unaligned accesses? In many compiler 
flag/machine combinations it did.

> I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch.
> 
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D8712B@smartserver.smartshare.dk/
> [2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/
>