* Optimizing memory access with DPDK allocated memory @ 2022-05-20 8:34 Antonio Di Bacco 2022-05-20 15:48 ` Stephen Hemminger 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2022-05-20 8:34 UTC (permalink / raw) To: users [-- Attachment #1: Type: text/plain, Size: 968 bytes --] Let us say I have two memory channels each one with its own 16GB memory module, I suppose the first memory channel will be used when addressing physical memory in the range 0 to 0x4 0000 0000 and the second when addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. Correct? Now, I need to have a 2GB buffer with one "writer" and one "reader", the writer writes on half of the buffer (call it A) and, in the meantime, the reader reads on the other half (B). When the writer finishes writing its half buffer (A), signal it to the reader and they swap, the reader starts to read from A and writer starts to write to B. If I allocate the whole buffer (on two 1GB hugepages) across the two memory channels, one half of the buffer is allocated on the end of first channel while the other half is allocated on the start of the second memory channel, would this increase performances compared to the whole buffer allocated within the same memory channel? [-- Attachment #2: Type: text/html, Size: 1068 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Optimizing memory access with DPDK allocated memory 2022-05-20 8:34 Optimizing memory access with DPDK allocated memory Antonio Di Bacco @ 2022-05-20 15:48 ` Stephen Hemminger 2022-05-21 9:42 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Stephen Hemminger @ 2022-05-20 15:48 UTC (permalink / raw) To: Antonio Di Bacco; +Cc: users On Fri, 20 May 2022 10:34:46 +0200 Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > Let us say I have two memory channels each one with its own 16GB memory > module, I suppose the first memory channel will be used when addressing > physical memory in the range 0 to 0x4 0000 0000 and the second when > addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > Correct? > Now, I need to have a 2GB buffer with one "writer" and one "reader", the > writer writes on half of the buffer (call it A) and, in the meantime, the > reader reads on the other half (B). When the writer finishes writing its > half buffer (A), signal it to the reader and they swap, the reader starts > to read from A and writer starts to write to B. > If I allocate the whole buffer (on two 1GB hugepages) across the two memory > channels, one half of the buffer is allocated on the end of first channel > while the other half is allocated on the start of the second memory > channel, would this increase performances compared to the whole buffer > allocated within the same memory channel? Most systems just interleave memory chips based on number of filled slots. This is handled by BIOS before kernel even starts. The DPDK has a number of memory channels parameter and what it does is try and optimize memory allocation by spreading. Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Optimizing memory access with DPDK allocated memory 2022-05-20 15:48 ` Stephen Hemminger @ 2022-05-21 9:42 ` Antonio Di Bacco 2022-05-23 13:16 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2022-05-21 9:42 UTC (permalink / raw) To: Stephen Hemminger; +Cc: users I read a couple of articles (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of_Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 and this https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-intel-xeon-scalable-family-processors) and I understood a little bit more. If the XEON memory controller is able to spread contiguous memory accesses onto different channels in hardware (as Stepphen correctly stated), then, how DPDK with option -n can benefit an application? I also coded a test application to write a 1GB hugepage and calculate time needed but, equipping an additional two DIMM on two unused channels of my available six channels motherboard (X11DPi-NT) , I didn't observe any improvement. This is strange because adding two channels to the 4 already equipped should make a noticeable difference. For reference this is the small program for allocating and writing memory. https://github.com/adibacco/simple_mp_mem_2 and the results with 4 memory channels: https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKuZOy4lMIhG-1N7B8/edit?usp=sharing On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Fri, 20 May 2022 10:34:46 +0200 > Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > > Let us say I have two memory channels each one with its own 16GB memory > > module, I suppose the first memory channel will be used when addressing > > physical memory in the range 0 to 0x4 0000 0000 and the second when > > addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > > Correct? > > Now, I need to have a 2GB buffer with one "writer" and one "reader", the > > writer writes on half of the buffer (call it A) and, in the meantime, the > > reader reads on the other half (B). When the writer finishes writing its > > half buffer (A), signal it to the reader and they swap, the reader starts > > to read from A and writer starts to write to B. > > If I allocate the whole buffer (on two 1GB hugepages) across the two memory > > channels, one half of the buffer is allocated on the end of first channel > > while the other half is allocated on the start of the second memory > > channel, would this increase performances compared to the whole buffer > > allocated within the same memory channel? > > Most systems just interleave memory chips based on number of filled slots. > This is handled by BIOS before kernel even starts. > The DPDK has a number of memory channels parameter and what it does > is try and optimize memory allocation by spreading. > > Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Optimizing memory access with DPDK allocated memory 2022-05-21 9:42 ` Antonio Di Bacco @ 2022-05-23 13:16 ` Antonio Di Bacco 2022-05-25 7:30 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2022-05-23 13:16 UTC (permalink / raw) To: Stephen Hemminger; +Cc: users Got feedback from a guy working on HPC with DPDK and he told me that with dpdk mem-test (don't know where to find it) I should be doing 16GB/s with DDR4 (2666) per channel. In my case with 6 channels I should be doing 90GB/s .... that would be amazing! On Sat, May 21, 2022 at 11:42 AM Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > I read a couple of articles > (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of_Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 > and this https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-intel-xeon-scalable-family-processors) > and I understood a little bit more. > > If the XEON memory controller is able to spread contiguous memory > accesses onto different channels in hardware (as Stepphen correctly > stated), then, how DPDK with option -n can benefit an application? > I also coded a test application to write a 1GB hugepage and calculate > time needed but, equipping an additional two DIMM on two unused > channels of my available six channels motherboard (X11DPi-NT) , I > didn't observe any improvement. This is strange because adding two > channels to the 4 already equipped should make a noticeable > difference. > > For reference this is the small program for allocating and writing memory. > https://github.com/adibacco/simple_mp_mem_2 > and the results with 4 memory channels: > https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKuZOy4lMIhG-1N7B8/edit?usp=sharing > > > On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > On Fri, 20 May 2022 10:34:46 +0200 > > Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > > > > Let us say I have two memory channels each one with its own 16GB memory > > > module, I suppose the first memory channel will be used when addressing > > > physical memory in the range 0 to 0x4 0000 0000 and the second when > > > addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > > > Correct? > > > Now, I need to have a 2GB buffer with one "writer" and one "reader", the > > > writer writes on half of the buffer (call it A) and, in the meantime, the > > > reader reads on the other half (B). When the writer finishes writing its > > > half buffer (A), signal it to the reader and they swap, the reader starts > > > to read from A and writer starts to write to B. > > > If I allocate the whole buffer (on two 1GB hugepages) across the two memory > > > channels, one half of the buffer is allocated on the end of first channel > > > while the other half is allocated on the start of the second memory > > > channel, would this increase performances compared to the whole buffer > > > allocated within the same memory channel? > > > > Most systems just interleave memory chips based on number of filled slots. > > This is handled by BIOS before kernel even starts. > > The DPDK has a number of memory channels parameter and what it does > > is try and optimize memory allocation by spreading. > > > > Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Optimizing memory access with DPDK allocated memory 2022-05-23 13:16 ` Antonio Di Bacco @ 2022-05-25 7:30 ` Antonio Di Bacco 2022-05-25 10:55 ` Kinsella, Ray 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2022-05-25 7:30 UTC (permalink / raw) To: Stephen Hemminger; +Cc: users Just to add some more info that could possibly be useful to someone. Even if a processor has many memory channels; there is also another parameter to take into consideration, a given "core" cannot exploit all the memory bandwidth available. For example for a DDR4 2933 MT/s with 4 channels: the memory bandwidth is 2933 X 8 (# of bytes of width) X 4 (# of channels) = 93,866.88 MB/s bandwidth, or 94 GB/s but a single core (according to my tests with DPDK process writing a 1GB hugepage) is about 12 GB/s (with a block size exceeding the L3 cache size). Can anyone confirm that ? On Mon, May 23, 2022 at 3:16 PM Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > Got feedback from a guy working on HPC with DPDK and he told me that > with dpdk mem-test (don't know where to find it) I should be doing > 16GB/s with DDR4 (2666) per channel. In my case with 6 channels I > should be doing 90GB/s .... that would be amazing! > > On Sat, May 21, 2022 at 11:42 AM Antonio Di Bacco > <a.dibacco.ks@gmail.com> wrote: > > > > I read a couple of articles > > (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of_Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 > > and this https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-intel-xeon-scalable-family-processors) > > and I understood a little bit more. > > > > If the XEON memory controller is able to spread contiguous memory > > accesses onto different channels in hardware (as Stepphen correctly > > stated), then, how DPDK with option -n can benefit an application? > > I also coded a test application to write a 1GB hugepage and calculate > > time needed but, equipping an additional two DIMM on two unused > > channels of my available six channels motherboard (X11DPi-NT) , I > > didn't observe any improvement. This is strange because adding two > > channels to the 4 already equipped should make a noticeable > > difference. > > > > For reference this is the small program for allocating and writing memory. > > https://github.com/adibacco/simple_mp_mem_2 > > and the results with 4 memory channels: > > https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKuZOy4lMIhG-1N7B8/edit?usp=sharing > > > > > > On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger > > <stephen@networkplumber.org> wrote: > > > > > > On Fri, 20 May 2022 10:34:46 +0200 > > > Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > > > > > > Let us say I have two memory channels each one with its own 16GB memory > > > > module, I suppose the first memory channel will be used when addressing > > > > physical memory in the range 0 to 0x4 0000 0000 and the second when > > > > addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > > > > Correct? > > > > Now, I need to have a 2GB buffer with one "writer" and one "reader", the > > > > writer writes on half of the buffer (call it A) and, in the meantime, the > > > > reader reads on the other half (B). When the writer finishes writing its > > > > half buffer (A), signal it to the reader and they swap, the reader starts > > > > to read from A and writer starts to write to B. > > > > If I allocate the whole buffer (on two 1GB hugepages) across the two memory > > > > channels, one half of the buffer is allocated on the end of first channel > > > > while the other half is allocated on the start of the second memory > > > > channel, would this increase performances compared to the whole buffer > > > > allocated within the same memory channel? > > > > > > Most systems just interleave memory chips based on number of filled slots. > > > This is handled by BIOS before kernel even starts. > > > The DPDK has a number of memory channels parameter and what it does > > > is try and optimize memory allocation by spreading. > > > > > > Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Optimizing memory access with DPDK allocated memory 2022-05-25 7:30 ` Antonio Di Bacco @ 2022-05-25 10:55 ` Kinsella, Ray 2022-05-25 13:33 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Kinsella, Ray @ 2022-05-25 10:55 UTC (permalink / raw) To: Antonio Di Bacco, Stephen Hemminger; +Cc: users Hi Antonio, If it is an Intel Platform you are using. You can take a look at the Intel Memory Latency Checker. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html (don't be fooled by the name, it does measure bandwidth). Ray K -----Original Message----- From: Antonio Di Bacco <a.dibacco.ks@gmail.com> Sent: Wednesday 25 May 2022 08:30 To: Stephen Hemminger <stephen@networkplumber.org> Cc: users@dpdk.org Subject: Re: Optimizing memory access with DPDK allocated memory Just to add some more info that could possibly be useful to someone. Even if a processor has many memory channels; there is also another parameter to take into consideration, a given "core" cannot exploit all the memory bandwidth available. For example for a DDR4 2933 MT/s with 4 channels: the memory bandwidth is 2933 X 8 (# of bytes of width) X 4 (# of channels) = 93,866.88 MB/s bandwidth, or 94 GB/s but a single core (according to my tests with DPDK process writing a 1GB hugepage) is about 12 GB/s (with a block size exceeding the L3 cache size). Can anyone confirm that ? On Mon, May 23, 2022 at 3:16 PM Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > Got feedback from a guy working on HPC with DPDK and he told me that > with dpdk mem-test (don't know where to find it) I should be doing > 16GB/s with DDR4 (2666) per channel. In my case with 6 channels I > should be doing 90GB/s .... that would be amazing! > > On Sat, May 21, 2022 at 11:42 AM Antonio Di Bacco > <a.dibacco.ks@gmail.com> wrote: > > > > I read a couple of articles > > (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of > > _Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 > > and this > > https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-in > > tel-xeon-scalable-family-processors) > > and I understood a little bit more. > > > > If the XEON memory controller is able to spread contiguous memory > > accesses onto different channels in hardware (as Stepphen correctly > > stated), then, how DPDK with option -n can benefit an application? > > I also coded a test application to write a 1GB hugepage and > > calculate time needed but, equipping an additional two DIMM on two > > unused channels of my available six channels motherboard (X11DPi-NT) > > , I didn't observe any improvement. This is strange because adding > > two channels to the 4 already equipped should make a noticeable > > difference. > > > > For reference this is the small program for allocating and writing memory. > > https://github.com/adibacco/simple_mp_mem_2 > > and the results with 4 memory channels: > > https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKu > > ZOy4lMIhG-1N7B8/edit?usp=sharing > > > > > > On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger > > <stephen@networkplumber.org> wrote: > > > > > > On Fri, 20 May 2022 10:34:46 +0200 Antonio Di Bacco > > > <a.dibacco.ks@gmail.com> wrote: > > > > > > > Let us say I have two memory channels each one with its own 16GB > > > > memory module, I suppose the first memory channel will be used > > > > when addressing physical memory in the range 0 to 0x4 0000 0000 > > > > and the second when addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > > > > Correct? > > > > Now, I need to have a 2GB buffer with one "writer" and one > > > > "reader", the writer writes on half of the buffer (call it A) > > > > and, in the meantime, the reader reads on the other half (B). > > > > When the writer finishes writing its half buffer (A), signal it > > > > to the reader and they swap, the reader starts to read from A and writer starts to write to B. > > > > If I allocate the whole buffer (on two 1GB hugepages) across the > > > > two memory channels, one half of the buffer is allocated on the > > > > end of first channel while the other half is allocated on the > > > > start of the second memory channel, would this increase > > > > performances compared to the whole buffer allocated within the same memory channel? > > > > > > Most systems just interleave memory chips based on number of filled slots. > > > This is handled by BIOS before kernel even starts. > > > The DPDK has a number of memory channels parameter and what it > > > does is try and optimize memory allocation by spreading. > > > > > > Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Optimizing memory access with DPDK allocated memory 2022-05-25 10:55 ` Kinsella, Ray @ 2022-05-25 13:33 ` Antonio Di Bacco 0 siblings, 0 replies; 7+ messages in thread From: Antonio Di Bacco @ 2022-05-25 13:33 UTC (permalink / raw) To: Kinsella, Ray; +Cc: Stephen Hemminger, users Wonderful tool, now it is completely clear, I do not have a bottleneck on the DDR but on the core to DDR interface. Single core results: Command line parameters: mlc --max_bandwidth -H -k3 ALL Reads : 9239.05 3:1 Reads-Writes : 13348.68 2:1 Reads-Writes : 14360.44 1:1 Reads-Writes : 13792.73 Two cores: Command line parameters: mlc --max_bandwidth -H -k3-4 ALL Reads : 24666.55 3:1 Reads-Writes : 30905.30 2:1 Reads-Writes : 32256.26 1:1 Reads-Writes : 37349.44 Eight cores: Command line parameters: mlc --max_bandwidth -H -k3-10 ALL Reads : 78109.94 3:1 Reads-Writes : 62105.06 2:1 Reads-Writes : 59628.81 1:1 Reads-Writes : 55320.34 On Wed, May 25, 2022 at 12:55 PM Kinsella, Ray <ray.kinsella@intel.com> wrote: > > Hi Antonio, > > If it is an Intel Platform you are using. > You can take a look at the Intel Memory Latency Checker. > https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html > > (don't be fooled by the name, it does measure bandwidth). > > Ray K > > -----Original Message----- > From: Antonio Di Bacco <a.dibacco.ks@gmail.com> > Sent: Wednesday 25 May 2022 08:30 > To: Stephen Hemminger <stephen@networkplumber.org> > Cc: users@dpdk.org > Subject: Re: Optimizing memory access with DPDK allocated memory > > Just to add some more info that could possibly be useful to someone. > Even if a processor has many memory channels; there is also another parameter to take into consideration, a given "core" cannot exploit all the memory bandwidth available. > For example for a DDR4 2933 MT/s with 4 channels: > the memory bandwidth is 2933 X 8 (# of bytes of width) X 4 (# of > channels) = 93,866.88 MB/s bandwidth, or 94 GB/s but a single core (according to my tests with DPDK process writing a 1GB hugepage) is about 12 GB/s (with a block size exceeding the L3 cache size). > > Can anyone confirm that ? > > On Mon, May 23, 2022 at 3:16 PM Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > > > Got feedback from a guy working on HPC with DPDK and he told me that > > with dpdk mem-test (don't know where to find it) I should be doing > > 16GB/s with DDR4 (2666) per channel. In my case with 6 channels I > > should be doing 90GB/s .... that would be amazing! > > > > On Sat, May 21, 2022 at 11:42 AM Antonio Di Bacco > > <a.dibacco.ks@gmail.com> wrote: > > > > > > I read a couple of articles > > > (https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of > > > _Intel_Xeon_Scalable_systems?xtxsearchselecthit=1 > > > and this > > > https://www.exxactcorp.com/blog/HPC/balance-memory-guidelines-for-in > > > tel-xeon-scalable-family-processors) > > > and I understood a little bit more. > > > > > > If the XEON memory controller is able to spread contiguous memory > > > accesses onto different channels in hardware (as Stepphen correctly > > > stated), then, how DPDK with option -n can benefit an application? > > > I also coded a test application to write a 1GB hugepage and > > > calculate time needed but, equipping an additional two DIMM on two > > > unused channels of my available six channels motherboard (X11DPi-NT) > > > , I didn't observe any improvement. This is strange because adding > > > two channels to the 4 already equipped should make a noticeable > > > difference. > > > > > > For reference this is the small program for allocating and writing memory. > > > https://github.com/adibacco/simple_mp_mem_2 > > > and the results with 4 memory channels: > > > https://docs.google.com/spreadsheets/d/1mDoKYLMhMMKDaOS3RuGEnpPgRNKu > > > ZOy4lMIhG-1N7B8/edit?usp=sharing > > > > > > > > > On Fri, May 20, 2022 at 5:48 PM Stephen Hemminger > > > <stephen@networkplumber.org> wrote: > > > > > > > > On Fri, 20 May 2022 10:34:46 +0200 Antonio Di Bacco > > > > <a.dibacco.ks@gmail.com> wrote: > > > > > > > > > Let us say I have two memory channels each one with its own 16GB > > > > > memory module, I suppose the first memory channel will be used > > > > > when addressing physical memory in the range 0 to 0x4 0000 0000 > > > > > and the second when addressing physical memory in the range 0x4 0000 0000 to 0x7 ffff ffff. > > > > > Correct? > > > > > Now, I need to have a 2GB buffer with one "writer" and one > > > > > "reader", the writer writes on half of the buffer (call it A) > > > > > and, in the meantime, the reader reads on the other half (B). > > > > > When the writer finishes writing its half buffer (A), signal it > > > > > to the reader and they swap, the reader starts to read from A and writer starts to write to B. > > > > > If I allocate the whole buffer (on two 1GB hugepages) across the > > > > > two memory channels, one half of the buffer is allocated on the > > > > > end of first channel while the other half is allocated on the > > > > > start of the second memory channel, would this increase > > > > > performances compared to the whole buffer allocated within the same memory channel? > > > > > > > > Most systems just interleave memory chips based on number of filled slots. > > > > This is handled by BIOS before kernel even starts. > > > > The DPDK has a number of memory channels parameter and what it > > > > does is try and optimize memory allocation by spreading. > > > > > > > > Looks like you are inventing your own limited version of what memif does. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-05-25 13:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-20 8:34 Optimizing memory access with DPDK allocated memory Antonio Di Bacco 2022-05-20 15:48 ` Stephen Hemminger 2022-05-21 9:42 ` Antonio Di Bacco 2022-05-23 13:16 ` Antonio Di Bacco 2022-05-25 7:30 ` Antonio Di Bacco 2022-05-25 10:55 ` Kinsella, Ray 2022-05-25 13:33 ` Antonio Di Bacco
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).