From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 2104DA0543 for ; Thu, 7 Jul 2022 17:21:55 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 169764069D; Thu, 7 Jul 2022 17:21:55 +0200 (CEST) Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by mails.dpdk.org (Postfix) with ESMTP id 2550F4069D for ; Thu, 7 Jul 2022 17:21:54 +0200 (CEST) Received: by mail-lj1-f172.google.com with SMTP id c15so22717911ljr.0 for ; Thu, 07 Jul 2022 08:21:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=BiExFZoTZwQtLovi1lIz9okMLXGfoZHqJN4SOB9VZiw=; b=izOpUhrKV/5Jr1IQsqHN2Xvrf0AsJSlJplvHwYP+2YKpIEv/SCOYFXuxGqz+MaxsR8 4BDzUwJR0/6c16t+Jnk+/OfbRYsiZoy36RHvsncPSIlfjE9dqlROCDOndjc7L2ayw+PB uLLSlu8bh4e5MuDY46Zcg/5mc6HAzpWQzpX5c4V6Hu9rBzUtJWTr0yvZGFJfp5J8Gv7W DpVX0x/gVwVZXQTTRuopWQJqYdgRppHRiLnjoW/7eKSZUPg3qcPri7dP6APoohqyOUMS EqoqfCZyqWiDHxPgyAxpUzJdINuRoqTrxgkaiV9SmS5ZtnNp879Rjqzpzzyc6UzZxcBc /D9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=BiExFZoTZwQtLovi1lIz9okMLXGfoZHqJN4SOB9VZiw=; b=A3/VOt95FQMKJ/RBAWMwo+s/LDYq8PU0r0AS/NZevQE+7Nc2FcHrrQ3b1V2wiWvWI5 M5rk790mqyf0UL2Mro60H2dkLsU76UNJ3Xy6HnRh8sEZIdnjxSwjFvBFzoRTWzC14152 Nc0SDoLpzVGMpjdlHCwyvNytyFx1ylAnkaSYhp/x8ZgbpI6/Br/Q+UVmfqNuF5NuMgfz 4/Tmx41ofAV3BhlWNBlgPq2N+06j9M8hV2pRhHpHImkJm3v3vCuQmN6MbxgYJ/BhBNIV jL2nT0nn4iTC84xIGssmFUtkOx91ySlxSyczx+0i6Y+M7IMPEyGkTjRFBllvWWrbzjMN bSlg== X-Gm-Message-State: AJIora8IAAmV53e13JXq48I/TcyMiNl2QuIgrZp6ueuhRXssp0dMFqkL fASDCY5hNcZOeQF8YaUkEeINOzI+s4xYQyWDQbx8Uw== X-Google-Smtp-Source: AGRyM1thFXlk3CcpwpIG2I1TAr2TdK2IugHPZtiXGyxpmw0DJTA5sk6JMmaDn5xSRUjqqpPanDxbmUSt2LNr8vONNo8= X-Received: by 2002:a2e:a7cc:0:b0:25d:4756:6f95 with SMTP id x12-20020a2ea7cc000000b0025d47566f95mr5868833ljp.271.1657207313526; Thu, 07 Jul 2022 08:21:53 -0700 (PDT) MIME-Version: 1.0 References: <98CBD80474FA8B44BF855DF32C47DC35D87139@smartserver.smartshare.dk> <20220623123900.38283-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35D87169@smartserver.smartshare.dk> <9f543fc0-ae8a-067b-d13f-38a0503dd619@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87187@smartserver.smartshare.dk> <8b28786c-cf78-68e2-7022-1c68a8d8d119@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87189@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D8719B@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D8719B@smartserver.smartshare.dk> From: =?UTF-8?Q?Stanis=C5=82aw_Kardach?= Date: Thu, 7 Jul 2022 17:21:17 +0200 Message-ID: Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer To: =?UTF-8?Q?Morten_Br=C3=B8rup?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , Emil Berg , Bruce Richardson , dev , Stephen Hemminger , dpdk stable , bugzilla@dpdk.org, Olivier Matz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org On Thu, Jun 30, 2022 at 6:32 PM Morten Br=C3=B8rup wrote: > > > From: Mattias R=C3=B6nnblom [mailto:hofors@lysator.liu.se] > > Sent: Tuesday, 28 June 2022 08.28 > > > > On 2022-06-27 22:21, Morten Br=C3=B8rup wrote: > > >> From: Mattias R=C3=B6nnblom [mailto:hofors@lysator.liu.se] > > >> Sent: Monday, 27 June 2022 19.23 > > >> > > >> On 2022-06-27 15:22, Morten Br=C3=B8rup wrote: > > >>>> From: Emil Berg [mailto:emil.berg@ericsson.com] > > >>>> Sent: Monday, 27 June 2022 14.51 > > >>>> > > >>>>> From: Emil Berg > > >>>>> Sent: den 27 juni 2022 14:46 > > >>>>> > > >>>>>> From: Mattias R=C3=B6nnblom > > >>>>>> Sent: den 27 juni 2022 14:28 > > >>>>>> > > >>>>>> On 2022-06-23 14:51, Morten Br=C3=B8rup wrote: > > >>>>>>>> From: Morten Br=C3=B8rup [mailto:mb@smartsharesystems.com] > > >>>>>>>> Sent: Thursday, 23 June 2022 14.39 > > >>>>>>>> > > >>>>>>>> With this patch, the checksum can be calculated on an > > unaligned > > >>>> buffer. > > >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit > > >>>> aligned. > > >>>>>>>> > > >>>>>>>> The checksum is still calculated using a 16 bit aligned > > pointer, > > >>>> so > > >>>>>>>> the compiler can auto-vectorize the function's inner loop. > > >>>>>>>> > > >>>>>>>> When the buffer is unaligned, the first byte of the buffer is > > >>>>>>>> handled separately. Furthermore, the calculated checksum of > > the > > >>>>>>>> buffer is byte shifted before being added to the initial > > >>>> checksum, > > >>>>>>>> to compensate for the checksum having been calculated on the > > >>>> buffer > > >>>>>>>> shifted by one byte. > > >>>>>>>> > > >>>>>>>> v4: > > >>>>>>>> * Add copyright notice. > > >>>>>>>> * Include stdbool.h (Emil Berg). > > >>>>>>>> * Use RTE_PTR_ADD (Emil Berg). > > >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a > > >>>> word? > > >>>>>>>> v3: > > >>>>>>>> * Remove braces from single statement block. > > >>>>>>>> * Fix typo in commit message. > > >>>>>>>> v2: > > >>>>>>>> * Do not assume that the buffer is part of an aligned packet > > >>>> buffer. > > >>>>>>>> > > >>>>>>>> Bugzilla ID: 1035 > > >>>>>>>> Cc: stable@dpdk.org > > >>>>>>>> > > >>>>>>>> Signed-off-by: Morten Br=C3=B8rup > > > > > > [...] > > > > > >>>>>> > > >>>>>> The compiler will be able to auto vectorize even unaligned > > >>>> accesses, > > >>>>>> just with different instructions. From what I can tell, there's > > no > > >>>>>> performance impact, at least not on the x86_64 systems I tried > > on. > > >>>>>> > > >>>>>> I think you should remove the first special case conditional and > > >>>> use > > >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to > > >>>> retrieve > > >>>>>> the data. > > >>>>>> > > >>>>> > > >>>>> Here: > > >>>>> https://www.agner.org/optimize/instruction_tables.pdf > > >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the > > >> latency > > >>>> for > > >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some > > >>>> difference. > > >>>>> Although in practice I'm not sure what difference it makes. I've > > >> not > > >>>> seen any > > >>>>> difference in runtime between the two versions. > > >>>>> > > >>>> > > >>>> Correction to my comment: > > >>>> Those stats are for some older CPU. For some newer CPUs such as > > >> Tiger > > >>>> Lake the stats seem to be the same regardless of aligned or > > >> unaligned. > > >>>> > > >>> > > >>> I agree that the memcpy method is more elegant and easy to read. > > >>> > > >>> However, we would need to performance test the modified checksum > > >> function with a large number of CPUs to prove that we don't > > introduce a > > >> performance regression on any CPU architecture still supported by > > DPDK. > > >> And Emil already found a CPU where it costs 1 extra cycle per 16 > > bytes, > > >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP > > >> packet. > > >>> > > >> > > >> I think you've misunderstood what latency means in such tables. It's > > a > > >> data dependency thing, not a measure of throughput. The throughput > > is > > >> *much* higher. My guess would be two such instruction per clock. > > >> > > >> For your 1460 bytes example, my Zen3 AMD needs performs identical > > with > > >> both the current DPDK implementation, your patch, and a memcpy()- > > ified > > >> version of the current implementation. They all need ~130 clock > > >> cycles/packet, with warm caches. IPC is 3 instructions per cycle, > > but > > >> obvious not all instructions are SIMD. > > > > > > You're right, I wasn't thinking deeper about it before extrapolating. > > > > > > Great to see some real numbers! I wish someone would do the same > > testing on an old ARM CPU, so we could also see the other end of the > > scale. > > > > > > > I've ran it on an ARM A72. For the aligned 1460 bytes case I got: > > Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They > > performed about the same for all unaligned/aligned and sizes I tested. > > This platform (or could be GCC version as well) doesn't suffer from the > > unaligned performance degradation your patch showed on my AMD machine. > > > > >> The main issue with checksumming on the CPU is, in my experience, > > not > > >> that you don't have enough compute, but that you trash the caches. > > > > > > Agree. I have noticed that x86 has "non-temporal" instruction > > variants to load/store data without trashing the cache entirely. > > > > > > A variant of the checksum function using such instructions might be > > handy. > > > > > > > Yes, although you may need to prefetch the payload for good > > performance. > > > > > Variants of the memcpy function using such instructions might also be > > handy for some purposes, e.g. copying the contents of packets, where > > the original and/or copy will not accessed shortly thereafter. > > > > > > > Indeed and I think it's been discussed on the list. There's some work > > to > > get it right, since alignment requirement and the fact a different > > memory model is used for those SIMD instructions causes trouble for a > > generic implementation. (For x86_64.) > > I just posted an RFC [1] for such memcpy() and memset() functions, > so let's see how it fans out. > > [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smar= tserver.smartshare.dk/T/#u > > > > > >>> So I opted for a solution with zero changes to the inner loop, so > > no > > >> performance retesting is required (for the previously supported use > > >> cases, where the buffer is aligned). > > >>> > > >> > > >> You will see performance degradation with this solution as well, > > under > > >> certain conditions. For unaligned 100 bytes of data, the current > > DPDK > > >> implementation and the memcpy()-fied version needs ~21 cc/packet. > > Your > > >> patch needs 54 cc/packet. > > > > > > Yes, it's a tradeoff. I exclusively aimed at maintaining performance > > for the case with aligned buffers (under all circumstances, with all > > CPUs etc.), and ignored how it affects the performance for the case > > with unaligned buffers. > > > > > > Unlike this patch, the memcpy() variant has no additional branches > > for the unaligned case, so its performance should be generally > > unaffected by the buffer being aligned or not. However, I don't have > > sufficient in-depth CPU knowledge to say if this also applies to RISCV > > and older ARM CPUs still supported by DPDK. > > > > > > > I don't think avoiding RISCV non-catastrophic regressions triumphs > > improving performance on mainstream CPUs and avoiding code quality > > regressions. > +1 +1. In general RISC-V spec leaves the unaligned load/store handling to implementation (it might fault, it might not). The U74 core that I have at hand allows unaligned reads/writes. Though it's not a platform for performance evaluation (time measurement causes a trap to firmware), so I won't say anything on that. -- Best Regards, Stanis=C5=82aw Kardach