From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <webguru2688@gmail.com>
Received: from mail-lf1-f66.google.com (mail-lf1-f66.google.com
 [209.85.167.66]) by dpdk.org (Postfix) with ESMTP id 6A5962BFA
 for <users@dpdk.org>; Tue, 11 Sep 2018 19:18:55 +0200 (CEST)
Received: by mail-lf1-f66.google.com with SMTP id z11-v6so20984491lff.9
 for <users@dpdk.org>; Tue, 11 Sep 2018 10:18:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=CZtb4zmbe0we8k8XaqRUPk7BOlmustUIKfV/7BkZLTs=;
 b=r9Bx+pXf2iYQ9RG3sAaCWxaprdTM+s9niIakMrsOXaj2dYmxughwnz3Zov7nxZqEJj
 MRmgOF3F93oVsQjN65zRO+fNeCanAUsz1zU0CBy3OFOt/9oUT9VQUycY1ho+i4f0KZ9X
 8PhBZjt0Q5zKz8FFZgbnMsmsPcUGtkd2rqTVL817M8GduR/IpSiu+ztL4WtFf2sc1Dvx
 rJRA3uMT3aEBITdsw+B8VUIlsgfpAxVkoob3wJWiQIboV/Vz+CPjiL8MmI2mdiTdUc8I
 1fC5UPuYSQ31rgVk/2MU9GVGOy/+7LZUMGJH2mLv9A7eqY3YnNelOjvIRuDd0KsjMylg
 cDiQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=CZtb4zmbe0we8k8XaqRUPk7BOlmustUIKfV/7BkZLTs=;
 b=kueEMM3F3Nuu8ZdpUoMojU5Mot7KcNdVk170P37Ev3ABgu8y5OmSjrSrsiThcGxSnp
 iA2ozwcSrbPDXiRxrjrxmHWSL25P4PiHCO2heC49hMPcpdtWq5NPuNV9SHy3bF5TxJL9
 Wsbvp6k6FRhUNOS7kdiWCLjfqnwQcuSc8CumAWvDABgmeIiCkhGoOnf1IERvP3l3mWCB
 wgFdiKIYqizUDPC9iStzit+LGzrca3O2RVh37qPXvBPEd8kDpGl/lNrISLbXSpGAP8nB
 bAD1vZUavS0TifJ51GRfNs3sp3sZ3uTNmTReOyZ3NFxpoegBzUY0pTBZ6kFFCQgiAT2h
 rf7g==
X-Gm-Message-State: APzg51CwJBrioKxhZVN95XJLX18VCZMsgiZlhwPF5Qqu8QYjfePjLkg6
 fZOGkKf+YCcXzzkNOzagWKKMg78a5jBMXWpLyew=
X-Google-Smtp-Source: ANB0VdZTYshM7geubZ/eD+Q1BXu0Vn0wksZvq4UK2u+TBIHok7rVjh1YQqhNLwUdMCriZVi042OttzFxzbk2frk2FrY=
X-Received: by 2002:a19:d98f:: with SMTP id
 s15-v6mr17123448lfi.103.1536686334640; 
 Tue, 11 Sep 2018 10:18:54 -0700 (PDT)
MIME-Version: 1.0
References: <CAHJJQSVG+ogufRTeCTBQPwkQaQ88DyM0DoD_FHckXaBi7f2dRg@mail.gmail.com>
 <E77998C2-0D7A-42CB-9F2C-49BD64EF4B0C@intel.com>
 <CAHJJQSVmzS4Sg7i3taKbYEBG=hFQ-BSNcbpsg5Sp-WqmMYBeTw@mail.gmail.com>
 <ABAC8AF0-FB54-4793-9FA7-5D735A4FCEF6@intel.com>
In-Reply-To: <ABAC8AF0-FB54-4793-9FA7-5D735A4FCEF6@intel.com>
From: Arvind Narayanan <webguru2688@gmail.com>
Date: Tue, 11 Sep 2018 12:18:42 -0500
Message-ID: <CAHJJQSVNoV5yhrm0a3=sNke43aO8AqTvK030MJ2vLX8CzDF_8w@mail.gmail.com>
To: keith.wiles@intel.com
Cc: users@dpdk.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-users] How to use software prefetching for custom
 structures to increase throughput on the fast path
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Sep 2018 17:18:55 -0000

If I don't do any processing, I easily get 10G. It is only when I access
the tag when the throughput drops.
What confuses me is if I use the following snippet, it works at line rate.

```
int temp_key = 1; // declared outside of the for loop

for (i = 0; i < pkt_count; i++) {
    if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
    }
}
```

But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
fall in throughput (which in a way confirms the issue is due to cache
misses).

__rte_cache_aligned may not be required as the mempool from where I pull
the pre-allocated structs already cache aligns them. But let me try putting
it as well to the struct to make sure.

Yes, I did come across VTune, and there is a free trial period which I
guess would help me confirm that it is due the cache misses.

l2fwd and l3fwd easily achieves 10G. :(

Thanks,
Arvind

On Tue, Sep 11, 2018 at 11:52 AM Wiles, Keith <keith.wiles@intel.com> wrote:

>
>
> > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688@gmail.com>
> wrote:
> >
> > Keith, thanks!
> >
> > My structure's size is 24 bytes, and for that particular for-loop, I do
> not dereference the rte_mbuf pointer, hence my understanding is it wouldn't
> require to load 4 cache lines, correct?
> > I am only looking at the tags to make a decision and then simply move
> ahead on the fast-path.
>
> The mbufs does get accessed by the Rx path, so a cacheline is pulled. If
> you are not accessing the mbuf structure or data then I am not sure what is
> the problem. The my_packet structure is it starting on a cacheline and have
> you tried putting each structure on a cacheline using __rte_cache_aligned?
>
> Have you used vtune or some of the other tools in the intel site?
> https://software.intel.com/en-us/intel-vtune-amplifier-xe
>
> Not sure about cost or anything. Vtune is a great tool, but for me it does
> have some learning curve to understand the output.
>
> A Xeon core of this type should be able to forward packets nicely at 10G
> 64 byte frames. Maybe just do the normal Rx then process, but do not do all
> of the processing then send it back out like a dumb forwarder. Are the
> NIC(s) and cores on the same socket, if you have a multi-socket system?
> Just shooting in the dark here.
>
> Also did you try l2fwd or l3fwd example and see if that app can get to 10G.
>
> >
> > I tried the method suggested in ip_fragmentation example. I tried
> several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost
> throughput.
> >
> > Here is my CPU info:
> >
> > Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> > Architecture:          x86_64
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              15360K
> > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
> nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
> popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi
> flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> >
> > Just to provide some more context, I isolate the CPU core used from the
> kernel for fast-path, hence this core is fully dedicated to the fast-path
> pipeline.
> >
> > The only time when the performance bumps from 7.7G to ~8.4G (still not
> close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or
> MEMPOOL_F_NO_CACHE_ALIGN.
> >
> > Thanks,
> > Arvind
> >
> > ---------- Forwarded message ---------
> > From: Wiles, Keith <keith.wiles@intel.com>
> > Date: Tue, Sep 11, 2018 at 9:20 AM
> > Subject: Re: [dpdk-users] How to use software prefetching for custom
> structures to increase throughput on the fast path
> > To: Arvind Narayanan <webguru2688@gmail.com>
> > Cc: users@dpdk.org <users@dpdk.org>
> >
> >
> >
> >
> > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I am trying to write a DPDK application and finding it difficult to
> achieve
> > > line rate on a 10G NIC. I feel this has something to do with CPU
> caches and
> > > related optimizations, and would be grateful if someone can point me
> to the
> > > right direction.
> > >
> > > I wrap every rte_mbuf into my own structure say, my_packet. Here is
> > > my_packet's structure declaration:
> > >
> > > ```
> > > struct my_packet {
> > > struct rte_mbuf * m;
> > > uint16_t tag1;
> > > uint16_t tag2;
> > > }
> > > ```
> >
> > The only problem you have created is having to pull in another cache
> line by having to access my_packet structure. The mbuf is highly optimized
> to limit the number of cache lines required to be pulled into cache for an
> mbuf. The mbuf structure is split between RX and TX, when doing TX you
> touch one of the two cache lines the mbuf is contained in and RX you touch
> the other cache line, at least that is the reason for the order of the
> members in the mbuf.
> >
> > For the most port accessing a packet of data takes about 2-3 cache lines
> to load into memory. Getting the prefetches far enough in advance to get
> the cache lines into top level cache is hard to do. In one case if I
> removed the prefetches the performance increased not decreased. :-(
> >
> > Sound like you are hitting this problem of now loading 4 cache lines and
> this causes the CPU to stall. One method is to prefetch the packets in a
> list then prefetch the a number of cache lines in advanced then start
> processing the first packet of data. In some case I have seen prefetching 3
> packets worth of cache lines helps. YMMV
> >
> > You did not list processor you are using, but Intel Xeon processors have
> a limit to the number of outstanding prefetches you can have at a time, I
> think 8 is the number. Also VPP at fd.io does use this method too in
> order to prefetch the data and not allow the CPU to stall.
> >
> > Look in the examples/ip_fragmentation/main.c and look at the code that
> prefetches mbufs and data structures. I hope that one helps.
> >
> > >
> > > During initialization, I reserve a mempool of type struct my_packet
> with
> > > 8192 elements. Whenever I form my_packet, I get them in bursts,
> similarly
> > > for freeing I put them back into pool as bursts.
> > >
> > > So there is a loop in the datapath which touches each of these
> my_packet's
> > > tag to make a decision.
> > >
> > > ```
> > > for (i = 0; i < pkt_count; i++) {
> > >    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> > > **)&val[i]) < 0) {
> > >    }
> > > }
> > > ```
> > >
> > > Based on my tests, &(my_packet->tag1) is the cause for not letting me
> > > achieve line rate in the fast path. I say this because if I hardcode
> the
> > > tag1's value, I am able to achieve line rate. As a workaround, I tried
> to
> > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> > > my_packet(s) from my_packet[] array, but nothing seems to boost the
> > > throughput.
> > >
> > > I tried to play with the flags in rte_mempool_create() function call:
> > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> > > settles to ~8.5GB after 20 or 30 seconds.
> > > -- NO FLAG gives 7.7G
> > >
> > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> > >
> > > Any help or pointers are highly appreciated.
> > >
> > > Thanks,
> > > Arvind
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>