From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6B77343325;
	Tue, 14 Nov 2023 09:30:30 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id DF4DB40697;
	Tue, 14 Nov 2023 09:27:02 +0100 (CET)
Received: from dkmailrelay1.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 4BC354067B
 for <dev@dpdk.org>; Tue, 14 Nov 2023 09:27:01 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesys.local [192.168.4.10])
 by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 33EAC200F4;
 Tue, 14 Nov 2023 09:27:01 +0100 (CET)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Question about loop unrolling in rte_ring datastructure.
X-MimeOLE: Produced By Microsoft Exchange V6.5
Date: Tue, 14 Nov 2023 09:26:59 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F01E@smartserver.smartshare.dk>
In-Reply-To: <PAVPR08MB9185BCDC0DDF8CB6FB4AAD88EFB3A@PAVPR08MB9185.eurprd08.prod.outlook.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Question about loop unrolling in rte_ring datastructure.
Thread-Index: AQHaFloQHZPgl8xkUkuKMJgXq2GhnbB5cqGw
References: <PAVPR08MB9185BCDC0DDF8CB6FB4AAD88EFB3A@PAVPR08MB9185.eurprd08.prod.outlook.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Aditya Ambadipudi" <Aditya.Ambadipudi@arm.com>
Cc: <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Aditya Ambadipudi [mailto:Aditya.Ambadipudi@arm.com]=20
> Sent: Monday, 13 November 2023 19.15
>=20
> Hello all.
>=20
> My name is Aditya Ambadipudi. I am not the sharpest tool in the shed.
>=20
> I was reading through the rte_ring datastructure. And I have two =
questions about the optimizations that are being made there.
> 1. Loop unrolling:
> =
https://github.com/DPDK/dpdk/blob/main/lib/ring/rte_ring_elem_pvt.h#L28-L=
35
> Why are we unrolling these loops manually. GCC will generate SIMD =
instructions for these loops automatically. Irrespective of wheither or =
not we unroll the loops
> Unrolled loop: https://godbolt.org/z/n97noqYn7
> Regular loop:https://godbolt.org/z/h6G9o9773

You should make "int count" a function parameter, to make it unknown at =
compile time.
In your examples, the compiler knows that count is 100, and can optimize =
for that.

>=20
> This is true of both x86 and ARM.

Much of the code in DPDK is quite old, dating back to a time when =
compilers were not good at optimizing, so the developers optimized by =
hand.
Some of the optimizations, such as manual loop unrolling, may not be =
relevant anymore, where modern compilers might do a better job.

I don't know if this is the case here.

Experiments and suggestions for improvements are welcome!

After building DPDK, you can test the ring library performance with this =
command:

./app/test/dpdk-test --no-huge ring_perf_autotest

>=20
> 2. Normalizing to few fixed types:
>=20
> It looks like we separate out enqueue/dequeue operations into 3 =
functions. One for each element size 32, 64, 128.
>=20
> Again I am not clear on why we are doing this. Both 128 & 64 are =
multiples of 32. Why can't we just normalize everything to 32?
>=20
> I feel like this is in some shape or form related to loop unrolling. =
But I am not able to figure it out on my own.

They are optimization for some common use cases.

Please note that the "esize" parameter is known at compile time, so the =
compiler will e.g. for __rte_ring_dequeue_elems() choose =
__rte_ring_dequeue_elems_64(), __rte_ring_dequeue_elems_128() or the =
32-bit loop at build time, and omit the alternatives.

If nothing else, they tell the compiler that "count" (number of 32 bit =
elements) is divisble by 4 (for 128 bit element size) or 2 (for 64 bit =
element size):

Some CPU architectures have 64 and/or 128 bit registers, so a copy loop =
using those registers does not need to be followed by a trailing copy of =
any remaining 32-bit memory when the element size is known to be 64 or =
128 bit.

>=20
> I am working on a patch that is closely related to this. And I would =
greatly appreciate any assistance anyone can provide on this.=20
>=20
> Thank you,
> Aditya Ambadipudi

> IMPORTANT NOTICE: The contents of this email and any attachments are =
confidential and may also be privileged. If you are not the intended =
recipient, please notify the sender immediately and do not disclose the =
contents to any other person, use it for any purpose, or store or copy =
the information in any medium. Thank you.=20

It does not remain confidential when you post it on a public mailing =
list.
Please omit this footer when posting to the DPDK mailing lists.


Med venlig hilsen / Kind regards,
-Morten Br=F8rup