Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

DPDK usage discussions
 help / color / mirror / Atom feed

From: "Wiles, Keith" <keith.wiles@intel.com>
To: Arvind Narayanan <webguru2688@gmail.com>
Cc: "users@dpdk.org" <users@dpdk.org>
Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Date: Tue, 11 Sep 2018 14:20:00 +0000	[thread overview]
Message-ID: <E77998C2-0D7A-42CB-9F2C-49BD64EF4B0C@intel.com> (raw)
In-Reply-To: <CAHJJQSVG+ogufRTeCTBQPwkQaQ88DyM0DoD_FHckXaBi7f2dRg@mail.gmail.com>

> On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com> wrote:
> 
> Hi,
> 
> I am trying to write a DPDK application and finding it difficult to achieve
> line rate on a 10G NIC. I feel this has something to do with CPU caches and
> related optimizations, and would be grateful if someone can point me to the
> right direction.
> 
> I wrap every rte_mbuf into my own structure say, my_packet. Here is
> my_packet's structure declaration:
> 
> ```
> struct my_packet {
> struct rte_mbuf * m;
> uint16_t tag1;
> uint16_t tag2;
> }
> ```

The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf.

For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-(

Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV

You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall.

Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. 

> 
> During initialization, I reserve a mempool of type struct my_packet with
> 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> for freeing I put them back into pool as bursts.
> 
> So there is a loop in the datapath which touches each of these my_packet's
> tag to make a decision.
> 
> ```
> for (i = 0; i < pkt_count; i++) {
>    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> **)&val[i]) < 0) {
>    }
> }
> ```
> 
> Based on my tests, &(my_packet->tag1) is the cause for not letting me
> achieve line rate in the fast path. I say this because if I hardcode the
> tag1's value, I am able to achieve line rate. As a workaround, I tried to
> use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> my_packet(s) from my_packet[] array, but nothing seems to boost the
> throughput.
> 
> I tried to play with the flags in rte_mempool_create() function call:
> -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> settles to ~8.5GB after 20 or 30 seconds.
> -- NO FLAG gives 7.7G
> 
> I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> 
> Any help or pointers are highly appreciated.
> 
> Thanks,
> Arvind

Regards,
Keith

next prev parent reply	other threads:[~2018-09-11 14:20 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-11  8:15 Arvind Narayanan
2018-09-11 14:20 ` Wiles, Keith [this message]
2018-09-11 15:42   ` Arvind Narayanan
2018-09-11 16:52     ` Wiles, Keith
2018-09-11 17:18       ` Arvind Narayanan
2018-09-11 18:07         ` Stephen Hemminger
2018-09-11 18:39           ` Arvind Narayanan
2018-09-11 19:12             ` Stephen Hemminger
2018-09-12  8:22             ` Van Haaren, Harry
2018-09-11 19:36           ` Pierre Laurent
2018-09-11 21:49             ` Arvind Narayanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E77998C2-0D7A-42CB-9F2C-49BD64EF4B0C@intel.com \
    --to=keith.wiles@intel.com \
    --cc=users@dpdk.org \
    --cc=webguru2688@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).