[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

DPDK usage discussions
 help / color / mirror / Atom feed

* [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
@ 2018-09-11  8:15 Arvind Narayanan
  2018-09-11 14:20 ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Arvind Narayanan @ 2018-09-11  8:15 UTC (permalink / raw)
  To: users

Hi,

I am trying to write a DPDK application and finding it difficult to achieve
line rate on a 10G NIC. I feel this has something to do with CPU caches and
related optimizations, and would be grateful if someone can point me to the
right direction.

I wrap every rte_mbuf into my own structure say, my_packet. Here is
my_packet's structure declaration:

```
struct my_packet {
 struct rte_mbuf * m;
 uint16_t tag1;
 uint16_t tag2;
}
```

During initialization, I reserve a mempool of type struct my_packet with
8192 elements. Whenever I form my_packet, I get them in bursts, similarly
for freeing I put them back into pool as bursts.

So there is a loop in the datapath which touches each of these my_packet's
tag to make a decision.

```
for (i = 0; i < pkt_count; i++) {
    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
**)&val[i]) < 0) {
    }
}
```

Based on my tests, &(my_packet->tag1) is the cause for not letting me
achieve line rate in the fast path. I say this because if I hardcode the
tag1's value, I am able to achieve line rate. As a workaround, I tried to
use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
my_packet(s) from my_packet[] array, but nothing seems to boost the
throughput.

I tried to play with the flags in rte_mempool_create() function call:
-- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
-- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
settles to ~8.5GB after 20 or 30 seconds.
-- NO FLAG gives 7.7G

I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.

Any help or pointers are highly appreciated.

Thanks,
Arvind

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11  8:15 [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Arvind Narayanan
@ 2018-09-11 14:20 ` Wiles, Keith
  2018-09-11 15:42   ` Arvind Narayanan
  0 siblings, 1 reply; 11+ messages in thread
From: Wiles, Keith @ 2018-09-11 14:20 UTC (permalink / raw)
  To: Arvind Narayanan; +Cc: users

> On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote:
> 
> Hi,
> 
> I am trying to write a DPDK application and finding it difficult to achieve
> line rate on a 10G NIC. I feel this has something to do with CPU caches and
> related optimizations, and would be grateful if someone can point me to the
> right direction.
> 
> I wrap every rte_mbuf into my own structure say, my_packet. Here is
> my_packet's structure declaration:
> 
> ```
> struct my_packet {
> struct rte_mbuf * m;
> uint16_t tag1;
> uint16_t tag2;
> }
> ```

The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf.

For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-(

Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV

You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall.

Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. 

> 
> During initialization, I reserve a mempool of type struct my_packet with
> 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> for freeing I put them back into pool as bursts.
> 
> So there is a loop in the datapath which touches each of these my_packet's
> tag to make a decision.
> 
> ```
> for (i = 0; i < pkt_count; i++) {
>    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> **)&val[i]) < 0) {
>    }
> }
> ```
> 
> Based on my tests, &(my_packet->tag1) is the cause for not letting me
> achieve line rate in the fast path. I say this because if I hardcode the
> tag1's value, I am able to achieve line rate. As a workaround, I tried to
> use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> my_packet(s) from my_packet[] array, but nothing seems to boost the
> throughput.
> 
> I tried to play with the flags in rte_mempool_create() function call:
> -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> settles to ~8.5GB after 20 or 30 seconds.
> -- NO FLAG gives 7.7G
> 
> I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> 
> Any help or pointers are highly appreciated.
> 
> Thanks,
> Arvind

Regards,
Keith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 14:20 ` Wiles, Keith
@ 2018-09-11 15:42   ` Arvind Narayanan
  2018-09-11 16:52     ` Wiles, Keith
  0 siblings, 1 reply; 11+ messages in thread
From: Arvind Narayanan @ 2018-09-11 15:42 UTC (permalink / raw)
  To: keith.wiles, users

Keith, thanks!

My structure's size is 24 bytes, and for that particular for-loop, I do not
dereference the rte_mbuf pointer, hence my understanding is it wouldn't
require to load 4 cache lines, correct?
I am only looking at the tags to make a decision and then simply move ahead
on the fast-path.

I tried the method suggested in ip_fragmentation example. I tried several
values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput.

Here is my CPU info:

Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Architecture:          x86_64
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi
flexpriority ept vpid xsaveopt dtherm ida arat pln pts

Just to provide some more context, I isolate the CPU core used from the
kernel for fast-path, hence this core is fully dedicated to the fast-path
pipeline.

The only time when the performance bumps from 7.7G to ~8.4G (still not
close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or
MEMPOOL_F_NO_CACHE_ALIGN.

Thanks,
Arvind

---------- Forwarded message ---------
From: Wiles, Keith <keith.wiles@intel.com>
Date: Tue, Sep 11, 2018 at 9:20 AM
Subject: Re: [dpdk-users] How to use software prefetching for custom
structures to increase throughput on the fast path
To: Arvind Narayanan <webguru2688@gmail.com>
Cc: users@dpdk.org <users@dpdk.org>

> On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com>
wrote:
>
> Hi,
>
> I am trying to write a DPDK application and finding it difficult to
achieve
> line rate on a 10G NIC. I feel this has something to do with CPU caches
and
> related optimizations, and would be grateful if someone can point me to
the
> right direction.
>
> I wrap every rte_mbuf into my own structure say, my_packet. Here is
> my_packet's structure declaration:
>
> ```
> struct my_packet {
> struct rte_mbuf * m;
> uint16_t tag1;
> uint16_t tag2;
> }
> ```

The only problem you have created is having to pull in another cache line
by having to access my_packet structure. The mbuf is highly optimized to
limit the number of cache lines required to be pulled into cache for an
mbuf. The mbuf structure is split between RX and TX, when doing TX you
touch one of the two cache lines the mbuf is contained in and RX you touch
the other cache line, at least that is the reason for the order of the
members in the mbuf.

For the most port accessing a packet of data takes about 2-3 cache lines to
load into memory. Getting the prefetches far enough in advance to get the
cache lines into top level cache is hard to do. In one case if I removed
the prefetches the performance increased not decreased. :-(

Sound like you are hitting this problem of now loading 4 cache lines and
this causes the CPU to stall. One method is to prefetch the packets in a
list then prefetch the a number of cache lines in advanced then start
processing the first packet of data. In some case I have seen prefetching 3
packets worth of cache lines helps. YMMV

You did not list processor you are using, but Intel Xeon processors have a
limit to the number of outstanding prefetches you can have at a time, I
think 8 is the number. Also VPP at fd.io does use this method too in order
to prefetch the data and not allow the CPU to stall.

Look in the examples/ip_fragmentation/main.c and look at the code that
prefetches mbufs and data structures. I hope that one helps.

>
> During initialization, I reserve a mempool of type struct my_packet with
> 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> for freeing I put them back into pool as bursts.
>
> So there is a loop in the datapath which touches each of these my_packet's
> tag to make a decision.
>
> ```
> for (i = 0; i < pkt_count; i++) {
>    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> **)&val[i]) < 0) {
>    }
> }
> ```
>
> Based on my tests, &(my_packet->tag1) is the cause for not letting me
> achieve line rate in the fast path. I say this because if I hardcode the
> tag1's value, I am able to achieve line rate. As a workaround, I tried to
> use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> my_packet(s) from my_packet[] array, but nothing seems to boost the
> throughput.
>
> I tried to play with the flags in rte_mempool_create() function call:
> -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> settles to ~8.5GB after 20 or 30 seconds.
> -- NO FLAG gives 7.7G
>
> I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
>
> Any help or pointers are highly appreciated.
>
> Thanks,
> Arvind

Regards,
Keith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 15:42   ` Arvind Narayanan
@ 2018-09-11 16:52     ` Wiles, Keith
  2018-09-11 17:18       ` Arvind Narayanan
  0 siblings, 1 reply; 11+ messages in thread
From: Wiles, Keith @ 2018-09-11 16:52 UTC (permalink / raw)
  To: Arvind Narayanan; +Cc: users



> On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688@gmail.com> wrote:
> 
> Keith, thanks!
> 
> My structure's size is 24 bytes, and for that particular for-loop, I do not dereference the rte_mbuf pointer, hence my understanding is it wouldn't require to load 4 cache lines, correct?
> I am only looking at the tags to make a decision and then simply move ahead on the fast-path.

The mbufs does get accessed by the Rx path, so a cacheline is pulled. If you are not accessing the mbuf structure or data then I am not sure what is the problem. The my_packet structure is it starting on a cacheline and have you tried putting each structure on a cacheline using __rte_cache_aligned?

Have you used vtune or some of the other tools in the intel site?
https://software.intel.com/en-us/intel-vtune-amplifier-xe

Not sure about cost or anything. Vtune is a great tool, but for me it does have some learning curve to understand the output.

A Xeon core of this type should be able to forward packets nicely at 10G 64 byte frames. Maybe just do the normal Rx then process, but do not do all of the processing then send it back out like a dumb forwarder. Are the NIC(s) and cores on the same socket, if you have a multi-socket system? Just shooting in the dark here.

Also did you try l2fwd or l3fwd example and see if that app can get to 10G.

> 
> I tried the method suggested in ip_fragmentation example. I tried several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput.
> 
> Here is my CPU info:
> 
> Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> Architecture:          x86_64
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              15360K
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> 
> Just to provide some more context, I isolate the CPU core used from the kernel for fast-path, hence this core is fully dedicated to the fast-path pipeline.
> 
> The only time when the performance bumps from 7.7G to ~8.4G (still not close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL_F_NO_CACHE_ALIGN.
> 
> Thanks,
> Arvind
> 
> ---------- Forwarded message ---------
> From: Wiles, Keith <keith.wiles@intel.com>
> Date: Tue, Sep 11, 2018 at 9:20 AM
> Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
> To: Arvind Narayanan <webguru2688@gmail.com>
> Cc: users@dpdk.org <users@dpdk.org>
> 
> 
> 
> 
> > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote:
> > 
> > Hi,
> > 
> > I am trying to write a DPDK application and finding it difficult to achieve
> > line rate on a 10G NIC. I feel this has something to do with CPU caches and
> > related optimizations, and would be grateful if someone can point me to the
> > right direction.
> > 
> > I wrap every rte_mbuf into my own structure say, my_packet. Here is
> > my_packet's structure declaration:
> > 
> > ```
> > struct my_packet {
> > struct rte_mbuf * m;
> > uint16_t tag1;
> > uint16_t tag2;
> > }
> > ```
> 
> The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf.
> 
> For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-(
> 
> Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV
> 
> You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall.
> 
> Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. 
> 
> > 
> > During initialization, I reserve a mempool of type struct my_packet with
> > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> > for freeing I put them back into pool as bursts.
> > 
> > So there is a loop in the datapath which touches each of these my_packet's
> > tag to make a decision.
> > 
> > ```
> > for (i = 0; i < pkt_count; i++) {
> >    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> > **)&val[i]) < 0) {
> >    }
> > }
> > ```
> > 
> > Based on my tests, &(my_packet->tag1) is the cause for not letting me
> > achieve line rate in the fast path. I say this because if I hardcode the
> > tag1's value, I am able to achieve line rate. As a workaround, I tried to
> > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> > my_packet(s) from my_packet[] array, but nothing seems to boost the
> > throughput.
> > 
> > I tried to play with the flags in rte_mempool_create() function call:
> > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> > settles to ~8.5GB after 20 or 30 seconds.
> > -- NO FLAG gives 7.7G
> > 
> > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> > 
> > Any help or pointers are highly appreciated.
> > 
> > Thanks,
> > Arvind
> 
> Regards,
> Keith
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 16:52     ` Wiles, Keith
@ 2018-09-11 17:18       ` Arvind Narayanan
  2018-09-11 18:07         ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Arvind Narayanan @ 2018-09-11 17:18 UTC (permalink / raw)
  To: keith.wiles; +Cc: users

If I don't do any processing, I easily get 10G. It is only when I access
the tag when the throughput drops.
What confuses me is if I use the following snippet, it works at line rate.

```
int temp_key = 1; // declared outside of the for loop

for (i = 0; i < pkt_count; i++) {
    if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
    }
}
```

But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
fall in throughput (which in a way confirms the issue is due to cache
misses).

__rte_cache_aligned may not be required as the mempool from where I pull
the pre-allocated structs already cache aligns them. But let me try putting
it as well to the struct to make sure.

Yes, I did come across VTune, and there is a free trial period which I
guess would help me confirm that it is due the cache misses.

l2fwd and l3fwd easily achieves 10G. :(

Thanks,
Arvind

On Tue, Sep 11, 2018 at 11:52 AM Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688@gmail.com>
> wrote:
> >
> > Keith, thanks!
> >
> > My structure's size is 24 bytes, and for that particular for-loop, I do
> not dereference the rte_mbuf pointer, hence my understanding is it wouldn't
> require to load 4 cache lines, correct?
> > I am only looking at the tags to make a decision and then simply move
> ahead on the fast-path.
>
> The mbufs does get accessed by the Rx path, so a cacheline is pulled. If
> you are not accessing the mbuf structure or data then I am not sure what is
> the problem. The my_packet structure is it starting on a cacheline and have
> you tried putting each structure on a cacheline using __rte_cache_aligned?
>
> Have you used vtune or some of the other tools in the intel site?
> https://software.intel.com/en-us/intel-vtune-amplifier-xe
>
> Not sure about cost or anything. Vtune is a great tool, but for me it does
> have some learning curve to understand the output.
>
> A Xeon core of this type should be able to forward packets nicely at 10G
> 64 byte frames. Maybe just do the normal Rx then process, but do not do all
> of the processing then send it back out like a dumb forwarder. Are the
> NIC(s) and cores on the same socket, if you have a multi-socket system?
> Just shooting in the dark here.
>
> Also did you try l2fwd or l3fwd example and see if that app can get to 10G.
>
> >
> > I tried the method suggested in ip_fragmentation example. I tried
> several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost
> throughput.
> >
> > Here is my CPU info:
> >
> > Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> > Architecture:          x86_64
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              15360K
> > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
> nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
> popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi
> flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> >
> > Just to provide some more context, I isolate the CPU core used from the
> kernel for fast-path, hence this core is fully dedicated to the fast-path
> pipeline.
> >
> > The only time when the performance bumps from 7.7G to ~8.4G (still not
> close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or
> MEMPOOL_F_NO_CACHE_ALIGN.
> >
> > Thanks,
> > Arvind
> >
> > ---------- Forwarded message ---------
> > From: Wiles, Keith <keith.wiles@intel.com>
> > Date: Tue, Sep 11, 2018 at 9:20 AM
> > Subject: Re: [dpdk-users] How to use software prefetching for custom
> structures to increase throughput on the fast path
> > To: Arvind Narayanan <webguru2688@gmail.com>
> > Cc: users@dpdk.org <users@dpdk.org>
> >
> >
> >
> >
> > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I am trying to write a DPDK application and finding it difficult to
> achieve
> > > line rate on a 10G NIC. I feel this has something to do with CPU
> caches and
> > > related optimizations, and would be grateful if someone can point me
> to the
> > > right direction.
> > >
> > > I wrap every rte_mbuf into my own structure say, my_packet. Here is
> > > my_packet's structure declaration:
> > >
> > > ```
> > > struct my_packet {
> > > struct rte_mbuf * m;
> > > uint16_t tag1;
> > > uint16_t tag2;
> > > }
> > > ```
> >
> > The only problem you have created is having to pull in another cache
> line by having to access my_packet structure. The mbuf is highly optimized
> to limit the number of cache lines required to be pulled into cache for an
> mbuf. The mbuf structure is split between RX and TX, when doing TX you
> touch one of the two cache lines the mbuf is contained in and RX you touch
> the other cache line, at least that is the reason for the order of the
> members in the mbuf.
> >
> > For the most port accessing a packet of data takes about 2-3 cache lines
> to load into memory. Getting the prefetches far enough in advance to get
> the cache lines into top level cache is hard to do. In one case if I
> removed the prefetches the performance increased not decreased. :-(
> >
> > Sound like you are hitting this problem of now loading 4 cache lines and
> this causes the CPU to stall. One method is to prefetch the packets in a
> list then prefetch the a number of cache lines in advanced then start
> processing the first packet of data. In some case I have seen prefetching 3
> packets worth of cache lines helps. YMMV
> >
> > You did not list processor you are using, but Intel Xeon processors have
> a limit to the number of outstanding prefetches you can have at a time, I
> think 8 is the number. Also VPP at fd.io does use this method too in
> order to prefetch the data and not allow the CPU to stall.
> >
> > Look in the examples/ip_fragmentation/main.c and look at the code that
> prefetches mbufs and data structures. I hope that one helps.
> >
> > >
> > > During initialization, I reserve a mempool of type struct my_packet
> with
> > > 8192 elements. Whenever I form my_packet, I get them in bursts,
> similarly
> > > for freeing I put them back into pool as bursts.
> > >
> > > So there is a loop in the datapath which touches each of these
> my_packet's
> > > tag to make a decision.
> > >
> > > ```
> > > for (i = 0; i < pkt_count; i++) {
> > >    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> > > **)&val[i]) < 0) {
> > >    }
> > > }
> > > ```
> > >
> > > Based on my tests, &(my_packet->tag1) is the cause for not letting me
> > > achieve line rate in the fast path. I say this because if I hardcode
> the
> > > tag1's value, I am able to achieve line rate. As a workaround, I tried
> to
> > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> > > my_packet(s) from my_packet[] array, but nothing seems to boost the
> > > throughput.
> > >
> > > I tried to play with the flags in rte_mempool_create() function call:
> > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> > > settles to ~8.5GB after 20 or 30 seconds.
> > > -- NO FLAG gives 7.7G
> > >
> > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> > >
> > > Any help or pointers are highly appreciated.
> > >
> > > Thanks,
> > > Arvind
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 17:18       ` Arvind Narayanan
@ 2018-09-11 18:07         ` Stephen Hemminger
  2018-09-11 18:39           ` Arvind Narayanan
  2018-09-11 19:36           ` Pierre Laurent
  0 siblings, 2 replies; 11+ messages in thread
From: Stephen Hemminger @ 2018-09-11 18:07 UTC (permalink / raw)
  To: Arvind Narayanan; +Cc: keith.wiles, users

On Tue, 11 Sep 2018 12:18:42 -0500
Arvind Narayanan <webguru2688@gmail.com> wrote:

> If I don't do any processing, I easily get 10G. It is only when I access
> the tag when the throughput drops.
> What confuses me is if I use the following snippet, it works at line rate.
> 
> ```
> int temp_key = 1; // declared outside of the for loop
> 
> for (i = 0; i < pkt_count; i++) {
>     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>     }
> }
> ```
> 
> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> fall in throughput (which in a way confirms the issue is due to cache
> misses).

Your packet data is not in cache.
Doing prefetch can help but it is very timing sensitive. If prefetch is done
before data is available it won't help. And if prefetch is done just before
data is used then there isn't enough cycles to get it from memory to the cache.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 18:07         ` Stephen Hemminger
@ 2018-09-11 18:39           ` Arvind Narayanan
  2018-09-11 19:12             ` Stephen Hemminger
  2018-09-12  8:22             ` Van Haaren, Harry
  2018-09-11 19:36           ` Pierre Laurent
  1 sibling, 2 replies; 11+ messages in thread
From: Arvind Narayanan @ 2018-09-11 18:39 UTC (permalink / raw)
  To: stephen; +Cc: keith.wiles, users

Stephen, thanks!

That is it! Not sure if there is any workaround.

So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s)
from its pre-allocated mempool, and then (bulk) enqueues it into a
rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the
data pointed by the ring's element (i.e. my_packet->tag1), this memory
access latency issue is seen. I cannot advance the prefetch any earlier. Is
there any clever workaround (or hack) to overcome this issue other than
using the same core for all the functions? For e.g. can I can prefetch the
packets in core 0 for core 1's cache (could be a dumb question!)?

Thanks,
Arvind

On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688@gmail.com> wrote:
>
> > If I don't do any processing, I easily get 10G. It is only when I access
> > the tag when the throughput drops.
> > What confuses me is if I use the following snippet, it works at line
> rate.
> >
> > ```
> > int temp_key = 1; // declared outside of the for loop
> >
> > for (i = 0; i < pkt_count; i++) {
> >     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) <
> 0) {
> >     }
> > }
> > ```
> >
> > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> > fall in throughput (which in a way confirms the issue is due to cache
> > misses).
>
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is
> done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the
> cache.
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 18:39           ` Arvind Narayanan
@ 2018-09-11 19:12             ` Stephen Hemminger
  2018-09-12  8:22             ` Van Haaren, Harry
  1 sibling, 0 replies; 11+ messages in thread
From: Stephen Hemminger @ 2018-09-11 19:12 UTC (permalink / raw)
  To: Arvind Narayanan; +Cc: keith.wiles, users

On Tue, 11 Sep 2018 13:39:24 -0500
Arvind Narayanan <webguru2688@gmail.com> wrote:

> Stephen, thanks!
> 
> That is it! Not sure if there is any workaround.
> 
> So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s)
> from its pre-allocated mempool, and then (bulk) enqueues it into a
> rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the
> data pointed by the ring's element (i.e. my_packet->tag1), this memory
> access latency issue is seen. I cannot advance the prefetch any earlier. Is
> there any clever workaround (or hack) to overcome this issue other than
> using the same core for all the functions? For e.g. can I can prefetch the
> packets in core 0 for core 1's cache (could be a dumb question!)?
> 
> Thanks,
> Arvind
> 
> On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger <
> stephen@networkplumber.org> wrote:  
> 
> > On Tue, 11 Sep 2018 12:18:42 -0500
> > Arvind Narayanan <webguru2688@gmail.com> wrote:
> >  
> > > If I don't do any processing, I easily get 10G. It is only when I access
> > > the tag when the throughput drops.
> > > What confuses me is if I use the following snippet, it works at line  
> > rate.  
> > >
> > > ```
> > > int temp_key = 1; // declared outside of the for loop
> > >
> > > for (i = 0; i < pkt_count; i++) {
> > >     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) <  
> > 0) {  
> > >     }
> > > }
> > > ```
> > >
> > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> > > fall in throughput (which in a way confirms the issue is due to cache
> > > misses).  
> >
> > Your packet data is not in cache.
> > Doing prefetch can help but it is very timing sensitive. If prefetch is
> > done
> > before data is available it won't help. And if prefetch is done just before
> > data is used then there isn't enough cycles to get it from memory to the
> > cache.
> >
> >
> >  

In my experience, if you want performance then don't pass packets between cores.
It is slightly less bad if the core that does the passing does not access the
packet. It is really bad if the handling core writes the packet.

Especially cores with greater cache distance (numa). If you have to then use
cores which share hyper-thread.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 18:07         ` Stephen Hemminger
  2018-09-11 18:39           ` Arvind Narayanan
@ 2018-09-11 19:36           ` Pierre Laurent
  2018-09-11 21:49             ` Arvind Narayanan
  1 sibling, 1 reply; 11+ messages in thread
From: Pierre Laurent @ 2018-09-11 19:36 UTC (permalink / raw)
  To: users, webguru2688

Can I suggest a few steps for investigating more ?

First, verify that the L1 cache is really the suspect one. this can be 
done simply with perf utility and the counter L1-dcache-load-misses. the 
simplest tool is "perf" which is part of linux-tools packages

|||$ apt-get install linux-tools-common linux-tools-generic 
linux-tools-`uname -r` |

$ sudo perf stat -d -d -d ./build/rxtx
EAL: Detected 12 lcore(s)
....

^C

  Performance counter stats for './build/rxtx':

        1413.787490      task-clock (msec)         # 0.923 CPUs utilized
                 18      context-switches          # 0.013 K/sec
                  4      cpu-migrations            # 0.003 K/sec
                238      page-faults               # 0.168 K/sec
      4,436,904,124      cycles                    # 3.138 
GHz                      (32.67%)
      3,888,094,815      stalled-cycles-frontend   # 87.63% frontend 
cycles idle     (32.94%)
        237,378,065      instructions              # 0.05  insn per cycle
                                                   # 16.38  stalled 
cycles per insn  (39.73%)
         76,863,834      branches                  # 54.367 
M/sec                    (40.01%)
            101,550      branch-misses             # 0.13% of all 
branches          (40.30%)
         94,805,298      L1-dcache-loads           # 67.058 
M/sec                    (39.77%)
        263,530,291      L1-dcache-load-misses     # 277.97% of all 
L1-dcache hits    (13.77%)
            425,934      LLC-loads                 # 0.301 
M/sec                    (13.60%)
            181,295      LLC-load-misses           # 42.56% of all 
LL-cache hits     (20.21%)
    <not supported> L1-icache-loads
            775,365 L1-icache-load-misses (26.71%)
         70,580,827      dTLB-loads                # 49.923 
M/sec                    (25.46%)
              2,474      dTLB-load-misses          # 0.00% of all dTLB 
cache hits   (13.01%)
                277      iTLB-loads                # 0.196 
K/sec                    (13.01%)
                994      iTLB-load-misses          # 358.84% of all iTLB 
cache hits   (19.52%)
    <not supported> L1-dcache-prefetches
              7,204      L1-dcache-prefetch-misses # 0.005 
M/sec                    (26.03%)

        1.531809863 seconds time elapsed

One of the common mistakes is to have excessively large tx and rx 
queues, which in turn helps trigger excessively large bursts. Your L1 
cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512 
cache lines is not much ..... If the bursts you are processing happen to 
be more than approx 128 buffers, then you will be trashing the cache 
when running your loop. I would notice that you use a pool of 8192 of 
your buffers, and if you use them round-robin, then you have a perfect 
recipe for cache trashing. If so, then prefetch would help.

rte_hash_lookup looks into cache lines too (at least 3 per successful 
invoke). If you use the same key, then rte_hash_lookup will look into 
the same cache lines. if your keys are randomly distributed, then it is 
another recipe for cache trashing.

It is not clear from your descriptions if the core which reads the 
bursts from dpdk PMD is the same than the core which does the 
processing. if a core touch your buffers (e.g. tag1), and then you pass 
the buffer to another core, than you get LLC coherency overheads, which 
would also trigger LLC-load-misses (which you can detect through perf 
output above)

It seems you have this type of processor (codename sandybridge, 6 cores, 
hyperthread is enabled)

https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI

Can you double check that your application run with the right core 
assignment  ? Since hyperthreading is enabled, you should not use 0 
(plenty functions for the linux kernel run on core 0) nor core 6 (which 
is the same hardware than core 0) and make sure the hyperthread 
corresponding to the core you are running is not used either. You can 
get the CPU<-->Core assignment with lscpu tool

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,0,0,0,,0,0,0,0
7,1,0,0,,1,1,1,0
8,2,0,0,,2,2,2,0
9,3,0,0,,3,3,3,0
10,4,0,0,,4,4,4,0
11,5,0,0,,5,5,5,0

If you do not need hyperthreading, and if L1 cache is your bottleneck, 
you might need to disable hyperthreading and get 64K bytes L1 cache per 
core. If you really need hyperthreading, then use less cache in your 
code by better tuning the buffer pool sizes.

SW prefetch is quite difficult to use efficiently. There are 4 different 
hardware prefetcher with different algorithms (adjacent cache lines, 
stride access ...) where the use of prefetch instruction is unnecessary, 
and there is a hw limit of about 8 pending L1 data cache misses 
(sometimes documented as 5, sometimes documented as 10 ..). This creates 
serious burden of software complexity to abide by the hw rules.

https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors 
. Just verify the hardware prefetchers are all enabled thru msr 0x1A4. 
Some bios might have created a different setup.

On 11/09/18 19:07, Stephen Hemminger wrote:
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688@gmail.com> wrote:
>
>> If I don't do any processing, I easily get 10G. It is only when I access
>> the tag when the throughput drops.
>> What confuses me is if I use the following snippet, it works at line rate.
>>
>> ```
>> int temp_key = 1; // declared outside of the for loop
>>
>> for (i = 0; i < pkt_count; i++) {
>>      if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>>      }
>> }
>> ```
>>
>> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
>> fall in throughput (which in a way confirms the issue is due to cache
>> misses).
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>

------
This email has been scanned for spam and malware by The Email Laundry.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 19:36           ` Pierre Laurent
@ 2018-09-11 21:49             ` Arvind Narayanan
  0 siblings, 0 replies; 11+ messages in thread
From: Arvind Narayanan @ 2018-09-11 21:49 UTC (permalink / raw)
  To: pierre; +Cc: users

Stephen and Pierre, thanks!
Pierre, all points noted.

As per Pierre's suggestions, I performed perf stat on the application. Here
are the results..

Using pktgen default configuration, I send 100M packets on a 10G line.

This is when I use my_packet->tag1 to lookup where the throughput drops to
8.4G/10G:

 Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':

      47453.031698      task-clock (msec)         #    1.830 CPUs
utilized
                77      context-switches          #    0.002
K/sec
                 6      cpu-migrations            #    0.000
K/sec
               868      page-faults               #    0.018
K/sec
   113,357,285,372      cycles                    #    2.389
GHz                      (49.95%)
    53,324,793,523      stalled-cycles-frontend   #   47.04% frontend
cycles idle     (49.95%)
    27,161,539,189      stalled-cycles-backend    #   23.96% backend cycles
idle      (49.96%)
   191,560,395,309      instructions              #    1.69  insn per
cycle
                                                  #    0.28  stalled cycles
per insn  (56.22%)
    36,872,293,868      branches                  #  777.027
M/sec                    (56.23%)
        13,801,124      branch-misses             #    0.04% of all
branches          (56.24%)
    67,524,214,383      L1-dcache-loads           # 1422.969
M/sec                    (56.24%)
     1,015,922,260      L1-dcache-load-misses     #    1.50% of all
L1-dcache hits    (56.26%)
       619,670,574      LLC-loads                 #   13.059
M/sec                    (56.29%)
            82,917      LLC-load-misses           #    0.01% of all
LL-cache hits     (56.31%)
   <not supported>
L1-icache-loads
         2,059,915
L1-icache-load-misses                                         (56.30%)
    67,641,851,208      dTLB-loads                # 1425.448
M/sec                    (56.29%)
           151,760      dTLB-load-misses          #    0.00% of all dTLB
cache hits   (50.01%)
               904      iTLB-loads                #    0.019
K/sec                    (50.01%)
            10,309      iTLB-load-misses          # 1140.38% of all iTLB
cache hits   (50.00%)
   <not supported>
L1-dcache-prefetches
       528,633,571      L1-dcache-prefetch-misses #   11.140
M/sec                    (49.97%)

      25.929843368 seconds time elapsed




This is when I use a temp_key approach:

 Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':

      42614.775381      task-clock (msec)         #    1.729 CPUs
utilized
                71      context-switches          #    0.002
K/sec
                 6      cpu-migrations            #    0.000
K/sec
               869      page-faults               #    0.020
K/sec
    99,422,031,536      cycles                    #    2.333
GHz                      (49.89%)
    43,615,501,744      stalled-cycles-frontend   #   43.87% frontend
cycles idle     (49.91%)
    21,325,495,955      stalled-cycles-backend    #   21.45% backend cycles
idle      (49.95%)
   170,398,414,529      instructions              #    1.71  insn per
cycle
                                                  #    0.26  stalled cycles
per insn  (56.22%)
    32,543,342,205      branches                  #  763.663
M/sec                    (56.26%)
        52,276,245      branch-misses             #    0.16% of all
branches          (56.30%)
    58,855,845,003      L1-dcache-loads           # 1381.114
M/sec                    (56.33%)
     1,046,059,603      L1-dcache-load-misses     #    1.78% of all
L1-dcache hits    (56.34%)
       598,557,493      LLC-loads                 #   14.046
M/sec                    (56.35%)
            84,048      LLC-load-misses           #    0.01% of all
LL-cache hits     (56.35%)
   <not supported>
L1-icache-loads
         2,150,306
L1-icache-load-misses                                         (56.33%)
    58,942,694,476      dTLB-loads                # 1383.152
M/sec                    (56.29%)
           147,013      dTLB-load-misses          #    0.00% of all dTLB
cache hits   (49.97%)
            22,392      iTLB-loads                #    0.525
K/sec                    (49.93%)
             5,839      iTLB-load-misses          #   26.08% of all iTLB
cache hits   (49.90%)
   <not supported>
L1-dcache-prefetches
       533,602,543      L1-dcache-prefetch-misses #   12.522
M/sec                    (49.89%)

      24.647230934 seconds time elapsed


Not sure if I am understanding it correctly, but there are a lot of
iTLB-load-misses in the lower-throughput perf stat output.

One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much .....  If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>

You raised a very good point here and I think DPDK's writing efficient code
page <https://doc.dpdk.org/guides/prog_guide/writing_efficient_code.html>
could maybe have a section on this topic and help on understanding how this
assignment helps, or maybe I should have missed it if DPDK already has
details about how to assign RX and TX ring sizes. Without knowing compute
load of each part on the data path, people like me just assign random 2^n
values (I blame myself here though).

rte_mbuf pool size is 4096
rx_ring and tx_ring sizes are 1024
rings used to communicate between cores are 8192
my_packet mempool is 8192
MAX _BURST_SIZE for all the loops in the DPDK application is set to 32

It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses  (which you can detect through perf output above)
>

I isolate CPUs 1,2,3,4,5 from kernel, thus leaving 0 for kernel operations.
Core 2 (which runs an infinite RX/TX loop) reads the packets from DPDK PMD
and sets tag1 values, while Core 4 lookups the rte_hash table using tag1 as
key and proceeds further.

>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment  ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
I had had HT disabled for all the experiments.

Here is the output of lscpu -p

# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0

Thanks,
Arvind

On Tue, Sep 11, 2018 at 2:36 PM Pierre Laurent <pierre@emutex.com> wrote:

>
> Can I suggest a few steps for investigating more ?
>
> First, verify that the L1 cache is really the suspect one. this can be
> done simply with perf utility and the counter L1-dcache-load-misses. the
> simplest tool is "perf" which is part of linux-tools packages
>
> $ apt-get install linux-tools-common linux-tools-generic
> linux-tools-`uname -r`
>
> $ sudo perf stat -d -d -d ./build/rxtx
> EAL: Detected 12 lcore(s)
> ....
>
> ^C
>
>  Performance counter stats for './build/rxtx':
>
>        1413.787490      task-clock (msec)         #    0.923 CPUs
> utilized
>                 18      context-switches          #    0.013
> K/sec
>                  4      cpu-migrations            #    0.003
> K/sec
>                238      page-faults               #    0.168
> K/sec
>      4,436,904,124      cycles                    #    3.138
> GHz                      (32.67%)
>      3,888,094,815      stalled-cycles-frontend   #   87.63% frontend
> cycles idle     (32.94%)
>        237,378,065      instructions              #    0.05  insn per
> cycle
>                                                   #   16.38  stalled
> cycles per insn  (39.73%)
>         76,863,834      branches                  #   54.367
> M/sec                    (40.01%)
>            101,550      branch-misses             #    0.13% of all
> branches          (40.30%)
>         94,805,298      L1-dcache-loads           #   67.058
> M/sec                    (39.77%)
>        263,530,291      L1-dcache-load-misses     #  277.97% of all
> L1-dcache hits    (13.77%)
>            425,934      LLC-loads                 #    0.301
> M/sec                    (13.60%)
>            181,295      LLC-load-misses           #   42.56% of all
> LL-cache hits     (20.21%)
>    <not supported>
> L1-icache-loads
>            775,365
> L1-icache-load-misses                                         (26.71%)
>         70,580,827      dTLB-loads                #   49.923
> M/sec                    (25.46%)
>              2,474      dTLB-load-misses          #    0.00% of all dTLB
> cache hits   (13.01%)
>                277      iTLB-loads                #    0.196
> K/sec                    (13.01%)
>                994      iTLB-load-misses          #  358.84% of all iTLB
> cache hits   (19.52%)
>    <not supported>
> L1-dcache-prefetches
>              7,204      L1-dcache-prefetch-misses #    0.005
> M/sec                    (26.03%)
>
>        1.531809863 seconds time elapsed
>
>
> One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much .....  If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>
> rte_hash_lookup looks into cache lines too (at least 3 per successful
> invoke). If you use the same key, then rte_hash_lookup will look into the
> same cache lines. if your keys are randomly distributed, then it is another
> recipe for cache trashing.
>
>
> It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses  (which you can detect through perf output above)
>
>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment  ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
> $ lscpu -p
> # The following is the parsable format, which can be fed to other
> # programs. Each different item in every column has an unique ID
> # starting from zero.
> # CPU,Core,Socket,Node,,L1d,L1i,L2,L3
> 0,0,0,0,,0,0,0,0
> 1,1,0,0,,1,1,1,0
> 2,2,0,0,,2,2,2,0
> 3,3,0,0,,3,3,3,0
> 4,4,0,0,,4,4,4,0
> 5,5,0,0,,5,5,5,0
> 6,0,0,0,,0,0,0,0
> 7,1,0,0,,1,1,1,0
> 8,2,0,0,,2,2,2,0
> 9,3,0,0,,3,3,3,0
> 10,4,0,0,,4,4,4,0
> 11,5,0,0,,5,5,5,0
> If you do not need hyperthreading, and if L1 cache is your bottleneck, you
> might need to disable hyperthreading and get 64K bytes L1 cache per core.
> If you really need hyperthreading, then use less cache in your code by
> better tuning the buffer pool sizes.
>
>
> SW prefetch is quite difficult to use efficiently. There are 4 different
> hardware prefetcher with different algorithms (adjacent cache lines, stride
> access ...) where the use of prefetch instruction is unnecessary, and there
> is a hw limit of about 8 pending L1 data cache misses (sometimes documented
> as 5, sometimes documented as 10 ..). This creates serious burden of
> software complexity to abide by the hw rules.
>
>
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
> . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some
> bios might have created a different setup.
>
>
>
> On 11/09/18 19:07, Stephen Hemminger wrote:
>
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688@gmail.com> <webguru2688@gmail.com> wrote:
>
>
> If I don't do any processing, I easily get 10G. It is only when I access
> the tag when the throughput drops.
> What confuses me is if I use the following snippet, it works at line rate.
>
> ```
> int temp_key = 1; // declared outside of the for loop
>
> for (i = 0; i < pkt_count; i++) {
>     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>     }
> }
> ```
>
> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> fall in throughput (which in a way confirms the issue is due to cache
> misses).
>
>
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>
>
>
>
> ------
> This email has been scanned for spam and malware by The Email Laundry.
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
  2018-09-11 18:39           ` Arvind Narayanan
  2018-09-11 19:12             ` Stephen Hemminger
@ 2018-09-12  8:22             ` Van Haaren, Harry
  1 sibling, 0 replies; 11+ messages in thread
From: Van Haaren, Harry @ 2018-09-12  8:22 UTC (permalink / raw)
  To: Arvind Narayanan, stephen; +Cc: Wiles, Keith, users

> -----Original Message-----
> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Arvind Narayanan
> Sent: Tuesday, September 11, 2018 7:39 PM
> To: stephen@networkplumber.org
> Cc: Wiles, Keith <keith.wiles@intel.com>; users@dpdk.org
> Subject: Re: [dpdk-users] How to use software prefetching for custom
> structures to increase throughput on the fast path

<snip>

> So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s)
> from its pre-allocated mempool, and then (bulk) enqueues it into a
> rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the
> data pointed by the ring's element (i.e. my_packet->tag1)

You say "Bulk" here. Are you using "bulk" or "burst"?

Burst: http://doc.dpdk.org/api/rte__ring_8h.html#aff58e6a47ea3dca494dd0391d11b38ea
Bulk:  http://doc.dpdk.org/api/rte__ring_8h.html#ab8debfb458e927d559e7ce750048502d

Try using "burst" dequeue which will return the max number of packets available,
even if it is less than the size of the array you provided.  

Bulk will fail to dequeue anything unless your threshold of MAX was reached,
which means that likely you'll stall the consumer core waiting until MAX, and
then playing catchup again.

<snip>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-12  8:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11  8:15 [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Arvind Narayanan
2018-09-11 14:20 ` Wiles, Keith
2018-09-11 15:42   ` Arvind Narayanan
2018-09-11 16:52     ` Wiles, Keith
2018-09-11 17:18       ` Arvind Narayanan
2018-09-11 18:07         ` Stephen Hemminger
2018-09-11 18:39           ` Arvind Narayanan
2018-09-11 19:12             ` Stephen Hemminger
2018-09-12  8:22             ` Van Haaren, Harry
2018-09-11 19:36           ` Pierre Laurent
2018-09-11 21:49             ` Arvind Narayanan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).