* Question: Why doesn’t rte_ring use double-mapped VMA to eliminate wraparound logic?
@ 2025-11-07 16:57 Rami Neiman
2025-11-08 17:14 ` Stephen Hemminger
0 siblings, 1 reply; 2+ messages in thread
From: Rami Neiman @ 2025-11-07 16:57 UTC (permalink / raw)
To: dev; +Cc: Duane Pauls
[-- Attachment #1: Type: text/plain, Size: 2402 bytes --]
Hi all,
I have a design question regarding rte_ring that I didn’t find a historical rationale for in the archives.
Most modern high-perf ring buffers (e.g. some NIC drivers, some DB queue implementations, etc.) eliminate wrap-around branches by taking the ring’s element array and mapping two consecutive VA ranges to the same physical backing pages.
i.e. you allocate N elements, commit enough pages to cover N, then call mmap (or equivalent) again immediately following it, pointing to the same physical pages. So from the CPU’s POV the element array is logically [0 .. N*2) but physically it’s the same backing. Therefore a batch read/write can always be done as a single contiguous memcpy/CLD/STOS without conditionals, even if (head+bulk) exceeds N.
Pseudo illustration:
[phys buffer of N slots]
VA: [0 .. N) -> phys
VA: [N .. 2N) -> same phys
For multi-element enqueue/dequeue it eliminates the “if wrap → split” case entirely — you can always memcpy in one contiguous op.
Question:
Is there an explicit reason DPDK doesn’t use this technique for rte_ring?
e.g.
*
portability? (hugepages / VFIO?)
*
inability to rely on mmap trickery for hugepage backed memzones?
*
NUMA locality considerations?
*
historical reason: first gen ring didn’t bulk enqueue so the branch didn’t matter?
*
reluctance to add VA aliasing because of security / introspection tooling / ASan issues?
I just want to understand the architectural trade that was made.
Because on 64-bit Linux, double-mapping a 1–2 MB region is pretty trivial, and bulk ops in DPDK are extremely common — it feels like an obvious microarchitectural win for the “hot ring” case.
So: is there a concrete blocker? or simply “no one pushed a patch because current perf was ‘good enough’”?
Pointers to prior mailing list discussion / patches would be appreciated.
Thanks,
Rami Neiman
________________________________
Confidentiality notice
This e-mail message and any attachment hereto contain confidential information which may be privileged and which is intended for the exclusive use of its addressee(s). If you receive this message in error, please inform sender immediately and destroy any copy thereof. Furthermore, any disclosure, distribution or copying of this message and/or any attachment hereto without the consent of the sender is strictly prohibited. Thank you.
[-- Attachment #2: Type: text/html, Size: 7188 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Question: Why doesn’t rte_ring use double-mapped VMA to eliminate wraparound logic?
2025-11-07 16:57 Question: Why doesn’t rte_ring use double-mapped VMA to eliminate wraparound logic? Rami Neiman
@ 2025-11-08 17:14 ` Stephen Hemminger
0 siblings, 0 replies; 2+ messages in thread
From: Stephen Hemminger @ 2025-11-08 17:14 UTC (permalink / raw)
To: Rami Neiman; +Cc: dev, Duane Pauls
On Fri, 7 Nov 2025 16:57:37 +0000
Rami Neiman <rami.neiman@solace.com> wrote:
> Hi all,
> I have a design question regarding rte_ring that I didn’t find a historical rationale for in the archives.
> Most modern high-perf ring buffers (e.g. some NIC drivers, some DB queue implementations, etc.) eliminate wrap-around branches by taking the ring’s element array and mapping two consecutive VA ranges to the same physical backing pages.
> i.e. you allocate N elements, commit enough pages to cover N, then call mmap (or equivalent) again immediately following it, pointing to the same physical pages. So from the CPU’s POV the element array is logically [0 .. N*2) but physically it’s the same backing. Therefore a batch read/write can always be done as a single contiguous memcpy/CLD/STOS without conditionals, even if (head+bulk) exceeds N.
> Pseudo illustration:
>
> [phys buffer of N slots]
> VA: [0 .. N) -> phys
> VA: [N .. 2N) -> same phys
>
>
> For multi-element enqueue/dequeue it eliminates the “if wrap → split” case entirely — you can always memcpy in one contiguous op.
> Question:
> Is there an explicit reason DPDK doesn’t use this technique for rte_ring?
> e.g.
Manipulating process mapping in user space is often non-portable, it is possible on Linux to use mmap
to do this but would require deep changes to the API.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-11-08 17:14 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 16:57 Question: Why doesn’t rte_ring use double-mapped VMA to eliminate wraparound logic? Rami Neiman
2025-11-08 17:14 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).