Question about loop unrolling in rte

DPDK patches and discussions
 help / color / mirror / Atom feed

* Question about loop unrolling in rte_ring datastructure.
@ 2023-11-13 18:14 Aditya Ambadipudi
  2023-11-14  8:26 ` Morten Brørup
  0 siblings, 1 reply; 2+ messages in thread
From: Aditya Ambadipudi @ 2023-11-13 18:14 UTC (permalink / raw)
  To: dev

[-- Attachment #1: Type: text/plain, Size: 1512 bytes --]

Hello all.

My name is Aditya Ambadipudi. I am not the sharpest tool in the shed.

I was reading through the rte_ring datastructure. And I have two questions about the optimizations that are being made there.

  1.  Loop unrolling:
https://github.com/DPDK/dpdk/blob/main/lib/ring/rte_ring_elem_pvt.h#L28-L35
Why are we unrolling these loops manually. GCC will generate SIMD instructions for these loops automatically. Irrespective of wheither or not we unroll the loops

Unrolled loop: https://godbolt.org/z/n97noqYn7
Regular loop:https://godbolt.org/z/h6G9o9773

This is true of both x86 and ARM.

  2.  Normalizing to few fixed types:

It looks like we separate out enqueue/dequeue operations into 3 functions. One for each element size 32, 64, 128.

Again I am not clear on why we are doing this. Both 128 & 64 are multiples of 32. Why can't we just normalize everything to 32?

I feel like this is in some shape or form related to loop unrolling. But I am not able to figure it out on my own.

I am working on a patch that is closely related to this. And I would greatly appreciate any assistance anyone can provide on this.

Thank you,
Aditya Ambadipudi
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

[-- Attachment #2: Type: text/html, Size: 4324 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: Question about loop unrolling in rte_ring datastructure.
  2023-11-13 18:14 Question about loop unrolling in rte_ring datastructure Aditya Ambadipudi
@ 2023-11-14  8:26 ` Morten Brørup
  0 siblings, 0 replies; 2+ messages in thread
From: Morten Brørup @ 2023-11-14  8:26 UTC (permalink / raw)
  To: Aditya Ambadipudi; +Cc: dev

> From: Aditya Ambadipudi [mailto:Aditya.Ambadipudi@arm.com] 
> Sent: Monday, 13 November 2023 19.15
> 
> Hello all.
> 
> My name is Aditya Ambadipudi. I am not the sharpest tool in the shed.
> 
> I was reading through the rte_ring datastructure. And I have two questions about the optimizations that are being made there.
> 1. Loop unrolling:
> https://github.com/DPDK/dpdk/blob/main/lib/ring/rte_ring_elem_pvt.h#L28-L35
> Why are we unrolling these loops manually. GCC will generate SIMD instructions for these loops automatically. Irrespective of wheither or not we unroll the loops
> Unrolled loop: https://godbolt.org/z/n97noqYn7
> Regular loop:https://godbolt.org/z/h6G9o9773

You should make "int count" a function parameter, to make it unknown at compile time.
In your examples, the compiler knows that count is 100, and can optimize for that.

> 
> This is true of both x86 and ARM.

Much of the code in DPDK is quite old, dating back to a time when compilers were not good at optimizing, so the developers optimized by hand.
Some of the optimizations, such as manual loop unrolling, may not be relevant anymore, where modern compilers might do a better job.

I don't know if this is the case here.

Experiments and suggestions for improvements are welcome!

After building DPDK, you can test the ring library performance with this command:

./app/test/dpdk-test --no-huge ring_perf_autotest

> 
> 2. Normalizing to few fixed types:
> 
> It looks like we separate out enqueue/dequeue operations into 3 functions. One for each element size 32, 64, 128.
> 
> Again I am not clear on why we are doing this. Both 128 & 64 are multiples of 32. Why can't we just normalize everything to 32?
> 
> I feel like this is in some shape or form related to loop unrolling. But I am not able to figure it out on my own.

They are optimization for some common use cases.

Please note that the "esize" parameter is known at compile time, so the compiler will e.g. for __rte_ring_dequeue_elems() choose __rte_ring_dequeue_elems_64(), __rte_ring_dequeue_elems_128() or the 32-bit loop at build time, and omit the alternatives.

If nothing else, they tell the compiler that "count" (number of 32 bit elements) is divisble by 4 (for 128 bit element size) or 2 (for 64 bit element size):

Some CPU architectures have 64 and/or 128 bit registers, so a copy loop using those registers does not need to be followed by a trailing copy of any remaining 32-bit memory when the element size is known to be 64 or 128 bit.

> 
> I am working on a patch that is closely related to this. And I would greatly appreciate any assistance anyone can provide on this. 
> 
> Thank you,
> Aditya Ambadipudi

> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. 

It does not remain confidential when you post it on a public mailing list.
Please omit this footer when posting to the DPDK mailing lists.

Med venlig hilsen / Kind regards,
-Morten Brørup

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-11-14  8:30 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-13 18:14 Question about loop unrolling in rte_ring datastructure Aditya Ambadipudi
2023-11-14  8:26 ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).