* [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path @ 2018-09-11 8:15 Arvind Narayanan 2018-09-11 14:20 ` Wiles, Keith 0 siblings, 1 reply; 11+ messages in thread From: Arvind Narayanan @ 2018-09-11 8:15 UTC (permalink / raw) To: users Hi, I am trying to write a DPDK application and finding it difficult to achieve line rate on a 10G NIC. I feel this has something to do with CPU caches and related optimizations, and would be grateful if someone can point me to the right direction. I wrap every rte_mbuf into my own structure say, my_packet. Here is my_packet's structure declaration: ``` struct my_packet { struct rte_mbuf * m; uint16_t tag1; uint16_t tag2; } ``` During initialization, I reserve a mempool of type struct my_packet with 8192 elements. Whenever I form my_packet, I get them in bursts, similarly for freeing I put them back into pool as bursts. So there is a loop in the datapath which touches each of these my_packet's tag to make a decision. ``` for (i = 0; i < pkt_count; i++) { if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void **)&val[i]) < 0) { } } ``` Based on my tests, &(my_packet->tag1) is the cause for not letting me achieve line rate in the fast path. I say this because if I hardcode the tag1's value, I am able to achieve line rate. As a workaround, I tried to use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 my_packet(s) from my_packet[] array, but nothing seems to boost the throughput. I tried to play with the flags in rte_mempool_create() function call: -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually settles to ~8.5GB after 20 or 30 seconds. -- NO FLAG gives 7.7G I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. Any help or pointers are highly appreciated. Thanks, Arvind ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 8:15 [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Arvind Narayanan @ 2018-09-11 14:20 ` Wiles, Keith 2018-09-11 15:42 ` Arvind Narayanan 0 siblings, 1 reply; 11+ messages in thread From: Wiles, Keith @ 2018-09-11 14:20 UTC (permalink / raw) To: Arvind Narayanan; +Cc: users > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote: > > Hi, > > I am trying to write a DPDK application and finding it difficult to achieve > line rate on a 10G NIC. I feel this has something to do with CPU caches and > related optimizations, and would be grateful if someone can point me to the > right direction. > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > my_packet's structure declaration: > > ``` > struct my_packet { > struct rte_mbuf * m; > uint16_t tag1; > uint16_t tag2; > } > ``` The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf. For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-( Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall. Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. > > During initialization, I reserve a mempool of type struct my_packet with > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly > for freeing I put them back into pool as bursts. > > So there is a loop in the datapath which touches each of these my_packet's > tag to make a decision. > > ``` > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > **)&val[i]) < 0) { > } > } > ``` > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > achieve line rate in the fast path. I say this because if I hardcode the > tag1's value, I am able to achieve line rate. As a workaround, I tried to > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > my_packet(s) from my_packet[] array, but nothing seems to boost the > throughput. > > I tried to play with the flags in rte_mempool_create() function call: > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > settles to ~8.5GB after 20 or 30 seconds. > -- NO FLAG gives 7.7G > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > Any help or pointers are highly appreciated. > > Thanks, > Arvind Regards, Keith ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 14:20 ` Wiles, Keith @ 2018-09-11 15:42 ` Arvind Narayanan 2018-09-11 16:52 ` Wiles, Keith 0 siblings, 1 reply; 11+ messages in thread From: Arvind Narayanan @ 2018-09-11 15:42 UTC (permalink / raw) To: keith.wiles, users Keith, thanks! My structure's size is 24 bytes, and for that particular for-loop, I do not dereference the rte_mbuf pointer, hence my understanding is it wouldn't require to load 4 cache lines, correct? I am only looking at the tags to make a decision and then simply move ahead on the fast-path. I tried the method suggested in ip_fragmentation example. I tried several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput. Here is my CPU info: Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Architecture: x86_64 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts Just to provide some more context, I isolate the CPU core used from the kernel for fast-path, hence this core is fully dedicated to the fast-path pipeline. The only time when the performance bumps from 7.7G to ~8.4G (still not close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL_F_NO_CACHE_ALIGN. Thanks, Arvind ---------- Forwarded message --------- From: Wiles, Keith <keith.wiles@intel.com> Date: Tue, Sep 11, 2018 at 9:20 AM Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path To: Arvind Narayanan <webguru2688@gmail.com> Cc: users@dpdk.org <users@dpdk.org> > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote: > > Hi, > > I am trying to write a DPDK application and finding it difficult to achieve > line rate on a 10G NIC. I feel this has something to do with CPU caches and > related optimizations, and would be grateful if someone can point me to the > right direction. > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > my_packet's structure declaration: > > ``` > struct my_packet { > struct rte_mbuf * m; > uint16_t tag1; > uint16_t tag2; > } > ``` The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf. For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-( Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall. Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. > > During initialization, I reserve a mempool of type struct my_packet with > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly > for freeing I put them back into pool as bursts. > > So there is a loop in the datapath which touches each of these my_packet's > tag to make a decision. > > ``` > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > **)&val[i]) < 0) { > } > } > ``` > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > achieve line rate in the fast path. I say this because if I hardcode the > tag1's value, I am able to achieve line rate. As a workaround, I tried to > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > my_packet(s) from my_packet[] array, but nothing seems to boost the > throughput. > > I tried to play with the flags in rte_mempool_create() function call: > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > settles to ~8.5GB after 20 or 30 seconds. > -- NO FLAG gives 7.7G > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > Any help or pointers are highly appreciated. > > Thanks, > Arvind Regards, Keith ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 15:42 ` Arvind Narayanan @ 2018-09-11 16:52 ` Wiles, Keith 2018-09-11 17:18 ` Arvind Narayanan 0 siblings, 1 reply; 11+ messages in thread From: Wiles, Keith @ 2018-09-11 16:52 UTC (permalink / raw) To: Arvind Narayanan; +Cc: users > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688@gmail.com> wrote: > > Keith, thanks! > > My structure's size is 24 bytes, and for that particular for-loop, I do not dereference the rte_mbuf pointer, hence my understanding is it wouldn't require to load 4 cache lines, correct? > I am only looking at the tags to make a decision and then simply move ahead on the fast-path. The mbufs does get accessed by the Rx path, so a cacheline is pulled. If you are not accessing the mbuf structure or data then I am not sure what is the problem. The my_packet structure is it starting on a cacheline and have you tried putting each structure on a cacheline using __rte_cache_aligned? Have you used vtune or some of the other tools in the intel site? https://software.intel.com/en-us/intel-vtune-amplifier-xe Not sure about cost or anything. Vtune is a great tool, but for me it does have some learning curve to understand the output. A Xeon core of this type should be able to forward packets nicely at 10G 64 byte frames. Maybe just do the normal Rx then process, but do not do all of the processing then send it back out like a dumb forwarder. Are the NIC(s) and cores on the same socket, if you have a multi-socket system? Just shooting in the dark here. Also did you try l2fwd or l3fwd example and see if that app can get to 10G. > > I tried the method suggested in ip_fragmentation example. I tried several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput. > > Here is my CPU info: > > Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > Architecture: x86_64 > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 15360K > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts > > Just to provide some more context, I isolate the CPU core used from the kernel for fast-path, hence this core is fully dedicated to the fast-path pipeline. > > The only time when the performance bumps from 7.7G to ~8.4G (still not close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL_F_NO_CACHE_ALIGN. > > Thanks, > Arvind > > ---------- Forwarded message --------- > From: Wiles, Keith <keith.wiles@intel.com> > Date: Tue, Sep 11, 2018 at 9:20 AM > Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path > To: Arvind Narayanan <webguru2688@gmail.com> > Cc: users@dpdk.org <users@dpdk.org> > > > > > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote: > > > > Hi, > > > > I am trying to write a DPDK application and finding it difficult to achieve > > line rate on a 10G NIC. I feel this has something to do with CPU caches and > > related optimizations, and would be grateful if someone can point me to the > > right direction. > > > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > > my_packet's structure declaration: > > > > ``` > > struct my_packet { > > struct rte_mbuf * m; > > uint16_t tag1; > > uint16_t tag2; > > } > > ``` > > The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf. > > For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-( > > Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV > > You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall. > > Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. > > > > > During initialization, I reserve a mempool of type struct my_packet with > > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly > > for freeing I put them back into pool as bursts. > > > > So there is a loop in the datapath which touches each of these my_packet's > > tag to make a decision. > > > > ``` > > for (i = 0; i < pkt_count; i++) { > > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > > **)&val[i]) < 0) { > > } > > } > > ``` > > > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > > achieve line rate in the fast path. I say this because if I hardcode the > > tag1's value, I am able to achieve line rate. As a workaround, I tried to > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > > my_packet(s) from my_packet[] array, but nothing seems to boost the > > throughput. > > > > I tried to play with the flags in rte_mempool_create() function call: > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > > settles to ~8.5GB after 20 or 30 seconds. > > -- NO FLAG gives 7.7G > > > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > > > Any help or pointers are highly appreciated. > > > > Thanks, > > Arvind > > Regards, > Keith > Regards, Keith ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 16:52 ` Wiles, Keith @ 2018-09-11 17:18 ` Arvind Narayanan 2018-09-11 18:07 ` Stephen Hemminger 0 siblings, 1 reply; 11+ messages in thread From: Arvind Narayanan @ 2018-09-11 17:18 UTC (permalink / raw) To: keith.wiles; +Cc: users If I don't do any processing, I easily get 10G. It is only when I access the tag when the throughput drops. What confuses me is if I use the following snippet, it works at line rate. ``` int temp_key = 1; // declared outside of the for loop for (i = 0; i < pkt_count; i++) { if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { } } ``` But as soon as I replace `temp_key` with `my_packet->tag1`, I experience fall in throughput (which in a way confirms the issue is due to cache misses). __rte_cache_aligned may not be required as the mempool from where I pull the pre-allocated structs already cache aligns them. But let me try putting it as well to the struct to make sure. Yes, I did come across VTune, and there is a free trial period which I guess would help me confirm that it is due the cache misses. l2fwd and l3fwd easily achieves 10G. :( Thanks, Arvind On Tue, Sep 11, 2018 at 11:52 AM Wiles, Keith <keith.wiles@intel.com> wrote: > > > > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688@gmail.com> > wrote: > > > > Keith, thanks! > > > > My structure's size is 24 bytes, and for that particular for-loop, I do > not dereference the rte_mbuf pointer, hence my understanding is it wouldn't > require to load 4 cache lines, correct? > > I am only looking at the tags to make a decision and then simply move > ahead on the fast-path. > > The mbufs does get accessed by the Rx path, so a cacheline is pulled. If > you are not accessing the mbuf structure or data then I am not sure what is > the problem. The my_packet structure is it starting on a cacheline and have > you tried putting each structure on a cacheline using __rte_cache_aligned? > > Have you used vtune or some of the other tools in the intel site? > https://software.intel.com/en-us/intel-vtune-amplifier-xe > > Not sure about cost or anything. Vtune is a great tool, but for me it does > have some learning curve to understand the output. > > A Xeon core of this type should be able to forward packets nicely at 10G > 64 byte frames. Maybe just do the normal Rx then process, but do not do all > of the processing then send it back out like a dumb forwarder. Are the > NIC(s) and cores on the same socket, if you have a multi-socket system? > Just shooting in the dark here. > > Also did you try l2fwd or l3fwd example and see if that app can get to 10G. > > > > > I tried the method suggested in ip_fragmentation example. I tried > several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost > throughput. > > > > Here is my CPU info: > > > > Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > > Architecture: x86_64 > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 256K > > L3 cache: 15360K > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good > nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor > ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic > popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi > flexpriority ept vpid xsaveopt dtherm ida arat pln pts > > > > Just to provide some more context, I isolate the CPU core used from the > kernel for fast-path, hence this core is fully dedicated to the fast-path > pipeline. > > > > The only time when the performance bumps from 7.7G to ~8.4G (still not > close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or > MEMPOOL_F_NO_CACHE_ALIGN. > > > > Thanks, > > Arvind > > > > ---------- Forwarded message --------- > > From: Wiles, Keith <keith.wiles@intel.com> > > Date: Tue, Sep 11, 2018 at 9:20 AM > > Subject: Re: [dpdk-users] How to use software prefetching for custom > structures to increase throughput on the fast path > > To: Arvind Narayanan <webguru2688@gmail.com> > > Cc: users@dpdk.org <users@dpdk.org> > > > > > > > > > > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> > wrote: > > > > > > Hi, > > > > > > I am trying to write a DPDK application and finding it difficult to > achieve > > > line rate on a 10G NIC. I feel this has something to do with CPU > caches and > > > related optimizations, and would be grateful if someone can point me > to the > > > right direction. > > > > > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > > > my_packet's structure declaration: > > > > > > ``` > > > struct my_packet { > > > struct rte_mbuf * m; > > > uint16_t tag1; > > > uint16_t tag2; > > > } > > > ``` > > > > The only problem you have created is having to pull in another cache > line by having to access my_packet structure. The mbuf is highly optimized > to limit the number of cache lines required to be pulled into cache for an > mbuf. The mbuf structure is split between RX and TX, when doing TX you > touch one of the two cache lines the mbuf is contained in and RX you touch > the other cache line, at least that is the reason for the order of the > members in the mbuf. > > > > For the most port accessing a packet of data takes about 2-3 cache lines > to load into memory. Getting the prefetches far enough in advance to get > the cache lines into top level cache is hard to do. In one case if I > removed the prefetches the performance increased not decreased. :-( > > > > Sound like you are hitting this problem of now loading 4 cache lines and > this causes the CPU to stall. One method is to prefetch the packets in a > list then prefetch the a number of cache lines in advanced then start > processing the first packet of data. In some case I have seen prefetching 3 > packets worth of cache lines helps. YMMV > > > > You did not list processor you are using, but Intel Xeon processors have > a limit to the number of outstanding prefetches you can have at a time, I > think 8 is the number. Also VPP at fd.io does use this method too in > order to prefetch the data and not allow the CPU to stall. > > > > Look in the examples/ip_fragmentation/main.c and look at the code that > prefetches mbufs and data structures. I hope that one helps. > > > > > > > > During initialization, I reserve a mempool of type struct my_packet > with > > > 8192 elements. Whenever I form my_packet, I get them in bursts, > similarly > > > for freeing I put them back into pool as bursts. > > > > > > So there is a loop in the datapath which touches each of these > my_packet's > > > tag to make a decision. > > > > > > ``` > > > for (i = 0; i < pkt_count; i++) { > > > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > > > **)&val[i]) < 0) { > > > } > > > } > > > ``` > > > > > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > > > achieve line rate in the fast path. I say this because if I hardcode > the > > > tag1's value, I am able to achieve line rate. As a workaround, I tried > to > > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > > > my_packet(s) from my_packet[] array, but nothing seems to boost the > > > throughput. > > > > > > I tried to play with the flags in rte_mempool_create() function call: > > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > > > settles to ~8.5GB after 20 or 30 seconds. > > > -- NO FLAG gives 7.7G > > > > > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > > > > > Any help or pointers are highly appreciated. > > > > > > Thanks, > > > Arvind > > > > Regards, > > Keith > > > > Regards, > Keith > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 17:18 ` Arvind Narayanan @ 2018-09-11 18:07 ` Stephen Hemminger 2018-09-11 18:39 ` Arvind Narayanan 2018-09-11 19:36 ` Pierre Laurent 0 siblings, 2 replies; 11+ messages in thread From: Stephen Hemminger @ 2018-09-11 18:07 UTC (permalink / raw) To: Arvind Narayanan; +Cc: keith.wiles, users On Tue, 11 Sep 2018 12:18:42 -0500 Arvind Narayanan <webguru2688@gmail.com> wrote: > If I don't do any processing, I easily get 10G. It is only when I access > the tag when the throughput drops. > What confuses me is if I use the following snippet, it works at line rate. > > ``` > int temp_key = 1; // declared outside of the for loop > > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { > } > } > ``` > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > fall in throughput (which in a way confirms the issue is due to cache > misses). Your packet data is not in cache. Doing prefetch can help but it is very timing sensitive. If prefetch is done before data is available it won't help. And if prefetch is done just before data is used then there isn't enough cycles to get it from memory to the cache. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 18:07 ` Stephen Hemminger @ 2018-09-11 18:39 ` Arvind Narayanan 2018-09-11 19:12 ` Stephen Hemminger 2018-09-12 8:22 ` Van Haaren, Harry 2018-09-11 19:36 ` Pierre Laurent 1 sibling, 2 replies; 11+ messages in thread From: Arvind Narayanan @ 2018-09-11 18:39 UTC (permalink / raw) To: stephen; +Cc: keith.wiles, users Stephen, thanks! That is it! Not sure if there is any workaround. So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s) from its pre-allocated mempool, and then (bulk) enqueues it into a rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the data pointed by the ring's element (i.e. my_packet->tag1), this memory access latency issue is seen. I cannot advance the prefetch any earlier. Is there any clever workaround (or hack) to overcome this issue other than using the same core for all the functions? For e.g. can I can prefetch the packets in core 0 for core 1's cache (could be a dumb question!)? Thanks, Arvind On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger < stephen@networkplumber.org> wrote: > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan <webguru2688@gmail.com> wrote: > > > If I don't do any processing, I easily get 10G. It is only when I access > > the tag when the throughput drops. > > What confuses me is if I use the following snippet, it works at line > rate. > > > > ``` > > int temp_key = 1; // declared outside of the for loop > > > > for (i = 0; i < pkt_count; i++) { > > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < > 0) { > > } > > } > > ``` > > > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > > fall in throughput (which in a way confirms the issue is due to cache > > misses). > > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is > done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the > cache. > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 18:39 ` Arvind Narayanan @ 2018-09-11 19:12 ` Stephen Hemminger 2018-09-12 8:22 ` Van Haaren, Harry 1 sibling, 0 replies; 11+ messages in thread From: Stephen Hemminger @ 2018-09-11 19:12 UTC (permalink / raw) To: Arvind Narayanan; +Cc: keith.wiles, users On Tue, 11 Sep 2018 13:39:24 -0500 Arvind Narayanan <webguru2688@gmail.com> wrote: > Stephen, thanks! > > That is it! Not sure if there is any workaround. > > So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s) > from its pre-allocated mempool, and then (bulk) enqueues it into a > rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the > data pointed by the ring's element (i.e. my_packet->tag1), this memory > access latency issue is seen. I cannot advance the prefetch any earlier. Is > there any clever workaround (or hack) to overcome this issue other than > using the same core for all the functions? For e.g. can I can prefetch the > packets in core 0 for core 1's cache (could be a dumb question!)? > > Thanks, > Arvind > > On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger < > stephen@networkplumber.org> wrote: > > > On Tue, 11 Sep 2018 12:18:42 -0500 > > Arvind Narayanan <webguru2688@gmail.com> wrote: > > > > > If I don't do any processing, I easily get 10G. It is only when I access > > > the tag when the throughput drops. > > > What confuses me is if I use the following snippet, it works at line > > rate. > > > > > > ``` > > > int temp_key = 1; // declared outside of the for loop > > > > > > for (i = 0; i < pkt_count; i++) { > > > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < > > 0) { > > > } > > > } > > > ``` > > > > > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > > > fall in throughput (which in a way confirms the issue is due to cache > > > misses). > > > > Your packet data is not in cache. > > Doing prefetch can help but it is very timing sensitive. If prefetch is > > done > > before data is available it won't help. And if prefetch is done just before > > data is used then there isn't enough cycles to get it from memory to the > > cache. > > > > > > In my experience, if you want performance then don't pass packets between cores. It is slightly less bad if the core that does the passing does not access the packet. It is really bad if the handling core writes the packet. Especially cores with greater cache distance (numa). If you have to then use cores which share hyper-thread. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 18:39 ` Arvind Narayanan 2018-09-11 19:12 ` Stephen Hemminger @ 2018-09-12 8:22 ` Van Haaren, Harry 1 sibling, 0 replies; 11+ messages in thread From: Van Haaren, Harry @ 2018-09-12 8:22 UTC (permalink / raw) To: Arvind Narayanan, stephen; +Cc: Wiles, Keith, users > -----Original Message----- > From: users [mailto:users-bounces@dpdk.org] On Behalf Of Arvind Narayanan > Sent: Tuesday, September 11, 2018 7:39 PM > To: stephen@networkplumber.org > Cc: Wiles, Keith <keith.wiles@intel.com>; users@dpdk.org > Subject: Re: [dpdk-users] How to use software prefetching for custom > structures to increase throughput on the fast path <snip> > So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s) > from its pre-allocated mempool, and then (bulk) enqueues it into a > rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the > data pointed by the ring's element (i.e. my_packet->tag1) You say "Bulk" here. Are you using "bulk" or "burst"? Burst: http://doc.dpdk.org/api/rte__ring_8h.html#aff58e6a47ea3dca494dd0391d11b38ea Bulk: http://doc.dpdk.org/api/rte__ring_8h.html#ab8debfb458e927d559e7ce750048502d Try using "burst" dequeue which will return the max number of packets available, even if it is less than the size of the array you provided. Bulk will fail to dequeue anything unless your threshold of MAX was reached, which means that likely you'll stall the consumer core waiting until MAX, and then playing catchup again. <snip> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 18:07 ` Stephen Hemminger 2018-09-11 18:39 ` Arvind Narayanan @ 2018-09-11 19:36 ` Pierre Laurent 2018-09-11 21:49 ` Arvind Narayanan 1 sibling, 1 reply; 11+ messages in thread From: Pierre Laurent @ 2018-09-11 19:36 UTC (permalink / raw) To: users, webguru2688 Can I suggest a few steps for investigating more ? First, verify that the L1 cache is really the suspect one. this can be done simply with perf utility and the counter L1-dcache-load-misses. the simplest tool is "perf" which is part of linux-tools packages |||$ apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r` | $ sudo perf stat -d -d -d ./build/rxtx EAL: Detected 12 lcore(s) .... ^C Performance counter stats for './build/rxtx': 1413.787490 task-clock (msec) # 0.923 CPUs utilized 18 context-switches # 0.013 K/sec 4 cpu-migrations # 0.003 K/sec 238 page-faults # 0.168 K/sec 4,436,904,124 cycles # 3.138 GHz (32.67%) 3,888,094,815 stalled-cycles-frontend # 87.63% frontend cycles idle (32.94%) 237,378,065 instructions # 0.05 insn per cycle # 16.38 stalled cycles per insn (39.73%) 76,863,834 branches # 54.367 M/sec (40.01%) 101,550 branch-misses # 0.13% of all branches (40.30%) 94,805,298 L1-dcache-loads # 67.058 M/sec (39.77%) 263,530,291 L1-dcache-load-misses # 277.97% of all L1-dcache hits (13.77%) 425,934 LLC-loads # 0.301 M/sec (13.60%) 181,295 LLC-load-misses # 42.56% of all LL-cache hits (20.21%) <not supported> L1-icache-loads 775,365 L1-icache-load-misses (26.71%) 70,580,827 dTLB-loads # 49.923 M/sec (25.46%) 2,474 dTLB-load-misses # 0.00% of all dTLB cache hits (13.01%) 277 iTLB-loads # 0.196 K/sec (13.01%) 994 iTLB-load-misses # 358.84% of all iTLB cache hits (19.52%) <not supported> L1-dcache-prefetches 7,204 L1-dcache-prefetch-misses # 0.005 M/sec (26.03%) 1.531809863 seconds time elapsed One of the common mistakes is to have excessively large tx and rx queues, which in turn helps trigger excessively large bursts. Your L1 cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is not much ..... If the bursts you are processing happen to be more than approx 128 buffers, then you will be trashing the cache when running your loop. I would notice that you use a pool of 8192 of your buffers, and if you use them round-robin, then you have a perfect recipe for cache trashing. If so, then prefetch would help. rte_hash_lookup looks into cache lines too (at least 3 per successful invoke). If you use the same key, then rte_hash_lookup will look into the same cache lines. if your keys are randomly distributed, then it is another recipe for cache trashing. It is not clear from your descriptions if the core which reads the bursts from dpdk PMD is the same than the core which does the processing. if a core touch your buffers (e.g. tag1), and then you pass the buffer to another core, than you get LLC coherency overheads, which would also trigger LLC-load-misses (which you can detect through perf output above) It seems you have this type of processor (codename sandybridge, 6 cores, hyperthread is enabled) https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI Can you double check that your application run with the right core assignment ? Since hyperthreading is enabled, you should not use 0 (plenty functions for the linux kernel run on core 0) nor core 6 (which is the same hardware than core 0) and make sure the hyperthread corresponding to the core you are running is not used either. You can get the CPU<-->Core assignment with lscpu tool $ lscpu -p # The following is the parsable format, which can be fed to other # programs. Each different item in every column has an unique ID # starting from zero. # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,4,0,0,,4,4,4,0 5,5,0,0,,5,5,5,0 6,0,0,0,,0,0,0,0 7,1,0,0,,1,1,1,0 8,2,0,0,,2,2,2,0 9,3,0,0,,3,3,3,0 10,4,0,0,,4,4,4,0 11,5,0,0,,5,5,5,0 If you do not need hyperthreading, and if L1 cache is your bottleneck, you might need to disable hyperthreading and get 64K bytes L1 cache per core. If you really need hyperthreading, then use less cache in your code by better tuning the buffer pool sizes. SW prefetch is quite difficult to use efficiently. There are 4 different hardware prefetcher with different algorithms (adjacent cache lines, stride access ...) where the use of prefetch instruction is unnecessary, and there is a hw limit of about 8 pending L1 data cache misses (sometimes documented as 5, sometimes documented as 10 ..). This creates serious burden of software complexity to abide by the hw rules. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some bios might have created a different setup. On 11/09/18 19:07, Stephen Hemminger wrote: > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan <webguru2688@gmail.com> wrote: > >> If I don't do any processing, I easily get 10G. It is only when I access >> the tag when the throughput drops. >> What confuses me is if I use the following snippet, it works at line rate. >> >> ``` >> int temp_key = 1; // declared outside of the for loop >> >> for (i = 0; i < pkt_count; i++) { >> if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { >> } >> } >> ``` >> >> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience >> fall in throughput (which in a way confirms the issue is due to cache >> misses). > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the cache. > > ------ This email has been scanned for spam and malware by The Email Laundry. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path 2018-09-11 19:36 ` Pierre Laurent @ 2018-09-11 21:49 ` Arvind Narayanan 0 siblings, 0 replies; 11+ messages in thread From: Arvind Narayanan @ 2018-09-11 21:49 UTC (permalink / raw) To: pierre; +Cc: users Stephen and Pierre, thanks! Pierre, all points noted. As per Pierre's suggestions, I performed perf stat on the application. Here are the results.. Using pktgen default configuration, I send 100M packets on a 10G line. This is when I use my_packet->tag1 to lookup where the throughput drops to 8.4G/10G: Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4 -- -p 3': 47453.031698 task-clock (msec) # 1.830 CPUs utilized 77 context-switches # 0.002 K/sec 6 cpu-migrations # 0.000 K/sec 868 page-faults # 0.018 K/sec 113,357,285,372 cycles # 2.389 GHz (49.95%) 53,324,793,523 stalled-cycles-frontend # 47.04% frontend cycles idle (49.95%) 27,161,539,189 stalled-cycles-backend # 23.96% backend cycles idle (49.96%) 191,560,395,309 instructions # 1.69 insn per cycle # 0.28 stalled cycles per insn (56.22%) 36,872,293,868 branches # 777.027 M/sec (56.23%) 13,801,124 branch-misses # 0.04% of all branches (56.24%) 67,524,214,383 L1-dcache-loads # 1422.969 M/sec (56.24%) 1,015,922,260 L1-dcache-load-misses # 1.50% of all L1-dcache hits (56.26%) 619,670,574 LLC-loads # 13.059 M/sec (56.29%) 82,917 LLC-load-misses # 0.01% of all LL-cache hits (56.31%) <not supported> L1-icache-loads 2,059,915 L1-icache-load-misses (56.30%) 67,641,851,208 dTLB-loads # 1425.448 M/sec (56.29%) 151,760 dTLB-load-misses # 0.00% of all dTLB cache hits (50.01%) 904 iTLB-loads # 0.019 K/sec (50.01%) 10,309 iTLB-load-misses # 1140.38% of all iTLB cache hits (50.00%) <not supported> L1-dcache-prefetches 528,633,571 L1-dcache-prefetch-misses # 11.140 M/sec (49.97%) 25.929843368 seconds time elapsed This is when I use a temp_key approach: Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4 -- -p 3': 42614.775381 task-clock (msec) # 1.729 CPUs utilized 71 context-switches # 0.002 K/sec 6 cpu-migrations # 0.000 K/sec 869 page-faults # 0.020 K/sec 99,422,031,536 cycles # 2.333 GHz (49.89%) 43,615,501,744 stalled-cycles-frontend # 43.87% frontend cycles idle (49.91%) 21,325,495,955 stalled-cycles-backend # 21.45% backend cycles idle (49.95%) 170,398,414,529 instructions # 1.71 insn per cycle # 0.26 stalled cycles per insn (56.22%) 32,543,342,205 branches # 763.663 M/sec (56.26%) 52,276,245 branch-misses # 0.16% of all branches (56.30%) 58,855,845,003 L1-dcache-loads # 1381.114 M/sec (56.33%) 1,046,059,603 L1-dcache-load-misses # 1.78% of all L1-dcache hits (56.34%) 598,557,493 LLC-loads # 14.046 M/sec (56.35%) 84,048 LLC-load-misses # 0.01% of all LL-cache hits (56.35%) <not supported> L1-icache-loads 2,150,306 L1-icache-load-misses (56.33%) 58,942,694,476 dTLB-loads # 1383.152 M/sec (56.29%) 147,013 dTLB-load-misses # 0.00% of all dTLB cache hits (49.97%) 22,392 iTLB-loads # 0.525 K/sec (49.93%) 5,839 iTLB-load-misses # 26.08% of all iTLB cache hits (49.90%) <not supported> L1-dcache-prefetches 533,602,543 L1-dcache-prefetch-misses # 12.522 M/sec (49.89%) 24.647230934 seconds time elapsed Not sure if I am understanding it correctly, but there are a lot of iTLB-load-misses in the lower-throughput perf stat output. One of the common mistakes is to have excessively large tx and rx queues, > which in turn helps trigger excessively large bursts. Your L1 cache is 32K, > that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is > not much ..... If the bursts you are processing happen to be more than > approx 128 buffers, then you will be trashing the cache when running your > loop. I would notice that you use a pool of 8192 of your buffers, and if > you use them round-robin, then you have a perfect recipe for cache > trashing. If so, then prefetch would help. > You raised a very good point here and I think DPDK's writing efficient code page <https://doc.dpdk.org/guides/prog_guide/writing_efficient_code.html> could maybe have a section on this topic and help on understanding how this assignment helps, or maybe I should have missed it if DPDK already has details about how to assign RX and TX ring sizes. Without knowing compute load of each part on the data path, people like me just assign random 2^n values (I blame myself here though). rte_mbuf pool size is 4096 rx_ring and tx_ring sizes are 1024 rings used to communicate between cores are 8192 my_packet mempool is 8192 MAX _BURST_SIZE for all the loops in the DPDK application is set to 32 It is not clear from your descriptions if the core which reads the bursts > from dpdk PMD is the same than the core which does the processing. if a > core touch your buffers (e.g. tag1), and then you pass the buffer to > another core, than you get LLC coherency overheads, which would also > trigger LLC-load-misses (which you can detect through perf output above) > I isolate CPUs 1,2,3,4,5 from kernel, thus leaving 0 for kernel operations. Core 2 (which runs an infinite RX/TX loop) reads the packets from DPDK PMD and sets tag1 values, while Core 4 lookups the rte_hash table using tag1 as key and proceeds further. > > It seems you have this type of processor (codename sandybridge, 6 cores, > hyperthread is enabled) > > > https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI > > Can you double check that your application run with the right core > assignment ? Since hyperthreading is enabled, you should not use 0 (plenty > functions for the linux kernel run on core 0) nor core 6 (which is the same > hardware than core 0) and make sure the hyperthread corresponding to the > core you are running is not used either. You can get the CPU<-->Core > assignment with lscpu tool > I had had HT disabled for all the experiments. Here is the output of lscpu -p # The following is the parsable format, which can be fed to other # programs. Each different item in every column has an unique ID # starting from zero. # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,4,0,0,,4,4,4,0 5,5,0,0,,5,5,5,0 Thanks, Arvind On Tue, Sep 11, 2018 at 2:36 PM Pierre Laurent <pierre@emutex.com> wrote: > > Can I suggest a few steps for investigating more ? > > First, verify that the L1 cache is really the suspect one. this can be > done simply with perf utility and the counter L1-dcache-load-misses. the > simplest tool is "perf" which is part of linux-tools packages > > $ apt-get install linux-tools-common linux-tools-generic > linux-tools-`uname -r` > > $ sudo perf stat -d -d -d ./build/rxtx > EAL: Detected 12 lcore(s) > .... > > ^C > > Performance counter stats for './build/rxtx': > > 1413.787490 task-clock (msec) # 0.923 CPUs > utilized > 18 context-switches # 0.013 > K/sec > 4 cpu-migrations # 0.003 > K/sec > 238 page-faults # 0.168 > K/sec > 4,436,904,124 cycles # 3.138 > GHz (32.67%) > 3,888,094,815 stalled-cycles-frontend # 87.63% frontend > cycles idle (32.94%) > 237,378,065 instructions # 0.05 insn per > cycle > # 16.38 stalled > cycles per insn (39.73%) > 76,863,834 branches # 54.367 > M/sec (40.01%) > 101,550 branch-misses # 0.13% of all > branches (40.30%) > 94,805,298 L1-dcache-loads # 67.058 > M/sec (39.77%) > 263,530,291 L1-dcache-load-misses # 277.97% of all > L1-dcache hits (13.77%) > 425,934 LLC-loads # 0.301 > M/sec (13.60%) > 181,295 LLC-load-misses # 42.56% of all > LL-cache hits (20.21%) > <not supported> > L1-icache-loads > 775,365 > L1-icache-load-misses (26.71%) > 70,580,827 dTLB-loads # 49.923 > M/sec (25.46%) > 2,474 dTLB-load-misses # 0.00% of all dTLB > cache hits (13.01%) > 277 iTLB-loads # 0.196 > K/sec (13.01%) > 994 iTLB-load-misses # 358.84% of all iTLB > cache hits (19.52%) > <not supported> > L1-dcache-prefetches > 7,204 L1-dcache-prefetch-misses # 0.005 > M/sec (26.03%) > > 1.531809863 seconds time elapsed > > > One of the common mistakes is to have excessively large tx and rx queues, > which in turn helps trigger excessively large bursts. Your L1 cache is 32K, > that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is > not much ..... If the bursts you are processing happen to be more than > approx 128 buffers, then you will be trashing the cache when running your > loop. I would notice that you use a pool of 8192 of your buffers, and if > you use them round-robin, then you have a perfect recipe for cache > trashing. If so, then prefetch would help. > > rte_hash_lookup looks into cache lines too (at least 3 per successful > invoke). If you use the same key, then rte_hash_lookup will look into the > same cache lines. if your keys are randomly distributed, then it is another > recipe for cache trashing. > > > It is not clear from your descriptions if the core which reads the bursts > from dpdk PMD is the same than the core which does the processing. if a > core touch your buffers (e.g. tag1), and then you pass the buffer to > another core, than you get LLC coherency overheads, which would also > trigger LLC-load-misses (which you can detect through perf output above) > > > It seems you have this type of processor (codename sandybridge, 6 cores, > hyperthread is enabled) > > > https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI > > Can you double check that your application run with the right core > assignment ? Since hyperthreading is enabled, you should not use 0 (plenty > functions for the linux kernel run on core 0) nor core 6 (which is the same > hardware than core 0) and make sure the hyperthread corresponding to the > core you are running is not used either. You can get the CPU<-->Core > assignment with lscpu tool > > $ lscpu -p > # The following is the parsable format, which can be fed to other > # programs. Each different item in every column has an unique ID > # starting from zero. > # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 > 0,0,0,0,,0,0,0,0 > 1,1,0,0,,1,1,1,0 > 2,2,0,0,,2,2,2,0 > 3,3,0,0,,3,3,3,0 > 4,4,0,0,,4,4,4,0 > 5,5,0,0,,5,5,5,0 > 6,0,0,0,,0,0,0,0 > 7,1,0,0,,1,1,1,0 > 8,2,0,0,,2,2,2,0 > 9,3,0,0,,3,3,3,0 > 10,4,0,0,,4,4,4,0 > 11,5,0,0,,5,5,5,0 > If you do not need hyperthreading, and if L1 cache is your bottleneck, you > might need to disable hyperthreading and get 64K bytes L1 cache per core. > If you really need hyperthreading, then use less cache in your code by > better tuning the buffer pool sizes. > > > SW prefetch is quite difficult to use efficiently. There are 4 different > hardware prefetcher with different algorithms (adjacent cache lines, stride > access ...) where the use of prefetch instruction is unnecessary, and there > is a hw limit of about 8 pending L1 data cache misses (sometimes documented > as 5, sometimes documented as 10 ..). This creates serious burden of > software complexity to abide by the hw rules. > > > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors > . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some > bios might have created a different setup. > > > > On 11/09/18 19:07, Stephen Hemminger wrote: > > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan <webguru2688@gmail.com> <webguru2688@gmail.com> wrote: > > > If I don't do any processing, I easily get 10G. It is only when I access > the tag when the throughput drops. > What confuses me is if I use the following snippet, it works at line rate. > > ``` > int temp_key = 1; // declared outside of the for loop > > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { > } > } > ``` > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > fall in throughput (which in a way confirms the issue is due to cache > misses). > > > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the cache. > > > > > > ------ > This email has been scanned for spam and malware by The Email Laundry. > > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-09-12 8:22 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-11 8:15 [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Arvind Narayanan 2018-09-11 14:20 ` Wiles, Keith 2018-09-11 15:42 ` Arvind Narayanan 2018-09-11 16:52 ` Wiles, Keith 2018-09-11 17:18 ` Arvind Narayanan 2018-09-11 18:07 ` Stephen Hemminger 2018-09-11 18:39 ` Arvind Narayanan 2018-09-11 19:12 ` Stephen Hemminger 2018-09-12 8:22 ` Van Haaren, Harry 2018-09-11 19:36 ` Pierre Laurent 2018-09-11 21:49 ` Arvind Narayanan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).