From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6B77343325; Tue, 14 Nov 2023 09:30:30 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DF4DB40697; Tue, 14 Nov 2023 09:27:02 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 4BC354067B for ; Tue, 14 Nov 2023 09:27:01 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 33EAC200F4; Tue, 14 Nov 2023 09:27:01 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: Question about loop unrolling in rte_ring datastructure. X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Tue, 14 Nov 2023 09:26:59 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F01E@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Question about loop unrolling in rte_ring datastructure. Thread-Index: AQHaFloQHZPgl8xkUkuKMJgXq2GhnbB5cqGw References: From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Aditya Ambadipudi" Cc: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Aditya Ambadipudi [mailto:Aditya.Ambadipudi@arm.com]=20 > Sent: Monday, 13 November 2023 19.15 >=20 > Hello all. >=20 > My name is Aditya Ambadipudi. I am not the sharpest tool in the shed. >=20 > I was reading through the rte_ring datastructure. And I have two = questions about the optimizations that are being made there. > 1. Loop unrolling: > = https://github.com/DPDK/dpdk/blob/main/lib/ring/rte_ring_elem_pvt.h#L28-L= 35 > Why are we unrolling these loops manually. GCC will generate SIMD = instructions for these loops automatically. Irrespective of wheither or = not we unroll the loops > Unrolled loop: https://godbolt.org/z/n97noqYn7 > Regular loop:https://godbolt.org/z/h6G9o9773 You should make "int count" a function parameter, to make it unknown at = compile time. In your examples, the compiler knows that count is 100, and can optimize = for that. >=20 > This is true of both x86 and ARM. Much of the code in DPDK is quite old, dating back to a time when = compilers were not good at optimizing, so the developers optimized by = hand. Some of the optimizations, such as manual loop unrolling, may not be = relevant anymore, where modern compilers might do a better job. I don't know if this is the case here. Experiments and suggestions for improvements are welcome! After building DPDK, you can test the ring library performance with this = command: ./app/test/dpdk-test --no-huge ring_perf_autotest >=20 > 2. Normalizing to few fixed types: >=20 > It looks like we separate out enqueue/dequeue operations into 3 = functions. One for each element size 32, 64, 128. >=20 > Again I am not clear on why we are doing this. Both 128 & 64 are = multiples of 32. Why can't we just normalize everything to 32? >=20 > I feel like this is in some shape or form related to loop unrolling. = But I am not able to figure it out on my own. They are optimization for some common use cases. Please note that the "esize" parameter is known at compile time, so the = compiler will e.g. for __rte_ring_dequeue_elems() choose = __rte_ring_dequeue_elems_64(), __rte_ring_dequeue_elems_128() or the = 32-bit loop at build time, and omit the alternatives. If nothing else, they tell the compiler that "count" (number of 32 bit = elements) is divisble by 4 (for 128 bit element size) or 2 (for 64 bit = element size): Some CPU architectures have 64 and/or 128 bit registers, so a copy loop = using those registers does not need to be followed by a trailing copy of = any remaining 32-bit memory when the element size is known to be 64 or = 128 bit. >=20 > I am working on a patch that is closely related to this. And I would = greatly appreciate any assistance anyone can provide on this.=20 >=20 > Thank you, > Aditya Ambadipudi > IMPORTANT NOTICE: The contents of this email and any attachments are = confidential and may also be privileged. If you are not the intended = recipient, please notify the sender immediately and do not disclose the = contents to any other person, use it for any purpose, or store or copy = the information in any medium. Thank you.=20 It does not remain confidential when you post it on a public mailing = list. Please omit this footer when posting to the DPDK mailing lists. Med venlig hilsen / Kind regards, -Morten Br=F8rup