From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id EDCD74C74 for ; Tue, 11 Sep 2018 18:52:39 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 11 Sep 2018 09:52:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,361,1531810800"; d="scan'208";a="70043120" Received: from fmsmsx104.amr.corp.intel.com ([10.18.124.202]) by fmsmga008.fm.intel.com with ESMTP; 11 Sep 2018 09:52:38 -0700 Received: from fmsmsx157.amr.corp.intel.com (10.18.116.73) by fmsmsx104.amr.corp.intel.com (10.18.124.202) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 11 Sep 2018 09:52:38 -0700 Received: from fmsmsx117.amr.corp.intel.com ([169.254.3.210]) by FMSMSX157.amr.corp.intel.com ([169.254.14.242]) with mapi id 14.03.0319.002; Tue, 11 Sep 2018 09:52:38 -0700 From: "Wiles, Keith" To: Arvind Narayanan CC: "users@dpdk.org" Thread-Topic: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Thread-Index: AQHUSae1ETeX649Rb0SmeeAQ5LM/lqTrlxUAgAAXKICAABN5gA== Date: Tue, 11 Sep 2018 16:52:37 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.254.188.247] Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 16:52:40 -0000 > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan wr= ote: >=20 > Keith, thanks! >=20 > My structure's size is 24 bytes, and for that particular for-loop, I do n= ot dereference the rte_mbuf pointer, hence my understanding is it wouldn't = require to load 4 cache lines, correct? > I am only looking at the tags to make a decision and then simply move ahe= ad on the fast-path. The mbufs does get accessed by the Rx path, so a cacheline is pulled. If yo= u are not accessing the mbuf structure or data then I am not sure what is t= he problem. The my_packet structure is it starting on a cacheline and have = you tried putting each structure on a cacheline using __rte_cache_aligned? Have you used vtune or some of the other tools in the intel site? https://software.intel.com/en-us/intel-vtune-amplifier-xe Not sure about cost or anything. Vtune is a great tool, but for me it does = have some learning curve to understand the output. A Xeon core of this type should be able to forward packets nicely at 10G 64= byte frames. Maybe just do the normal Rx then process, but do not do all o= f the processing then send it back out like a dumb forwarder. Are the NIC(s= ) and cores on the same socket, if you have a multi-socket system? Just sho= oting in the dark here. Also did you try l2fwd or l3fwd example and see if that app can get to 10G. >=20 > I tried the method suggested in ip_fragmentation example. I tried several= values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput. >=20 > Here is my CPU info: >=20 > Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > Architecture: x86_64 > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 15360K > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr p= ge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe sysca= ll nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xt= opology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl= vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt = tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept v= pid xsaveopt dtherm ida arat pln pts >=20 > Just to provide some more context, I isolate the CPU core used from the k= ernel for fast-path, hence this core is fully dedicated to the fast-path pi= peline. >=20 > The only time when the performance bumps from 7.7G to ~8.4G (still not cl= ose to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL= _F_NO_CACHE_ALIGN. >=20 > Thanks, > Arvind >=20 > ---------- Forwarded message --------- > From: Wiles, Keith > Date: Tue, Sep 11, 2018 at 9:20 AM > Subject: Re: [dpdk-users] How to use software prefetching for custom stru= ctures to increase throughput on the fast path > To: Arvind Narayanan > Cc: users@dpdk.org >=20 >=20 >=20 >=20 > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan w= rote: > >=20 > > Hi, > >=20 > > I am trying to write a DPDK application and finding it difficult to ach= ieve > > line rate on a 10G NIC. I feel this has something to do with CPU caches= and > > related optimizations, and would be grateful if someone can point me to= the > > right direction. > >=20 > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > > my_packet's structure declaration: > >=20 > > ``` > > struct my_packet { > > struct rte_mbuf * m; > > uint16_t tag1; > > uint16_t tag2; > > } > > ``` >=20 > The only problem you have created is having to pull in another cache line= by having to access my_packet structure. The mbuf is highly optimized to l= imit the number of cache lines required to be pulled into cache for an mbuf= . The mbuf structure is split between RX and TX, when doing TX you touch on= e of the two cache lines the mbuf is contained in and RX you touch the othe= r cache line, at least that is the reason for the order of the members in t= he mbuf. >=20 > For the most port accessing a packet of data takes about 2-3 cache lines = to load into memory. Getting the prefetches far enough in advance to get th= e cache lines into top level cache is hard to do. In one case if I removed = the prefetches the performance increased not decreased. :-( >=20 > Sound like you are hitting this problem of now loading 4 cache lines and = this causes the CPU to stall. One method is to prefetch the packets in a li= st then prefetch the a number of cache lines in advanced then start process= ing the first packet of data. In some case I have seen prefetching 3 packet= s worth of cache lines helps. YMMV >=20 > You did not list processor you are using, but Intel Xeon processors have = a limit to the number of outstanding prefetches you can have at a time, I t= hink 8 is the number. Also VPP at fd.io does use this method too in order t= o prefetch the data and not allow the CPU to stall. >=20 > Look in the examples/ip_fragmentation/main.c and look at the code that pr= efetches mbufs and data structures. I hope that one helps.=20 >=20 > >=20 > > During initialization, I reserve a mempool of type struct my_packet wit= h > > 8192 elements. Whenever I form my_packet, I get them in bursts, similar= ly > > for freeing I put them back into pool as bursts. > >=20 > > So there is a loop in the datapath which touches each of these my_packe= t's > > tag to make a decision. > >=20 > > ``` > > for (i =3D 0; i < pkt_count; i++) { > > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > > **)&val[i]) < 0) { > > } > > } > > ``` > >=20 > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > > achieve line rate in the fast path. I say this because if I hardcode th= e > > tag1's value, I am able to achieve line rate. As a workaround, I tried = to > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > > my_packet(s) from my_packet[] array, but nothing seems to boost the > > throughput. > >=20 > > I tried to play with the flags in rte_mempool_create() function call: > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > > settles to ~8.5GB after 20 or 30 seconds. > > -- NO FLAG gives 7.7G > >=20 > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > >=20 > > Any help or pointers are highly appreciated. > >=20 > > Thanks, > > Arvind >=20 > Regards, > Keith >=20 Regards, Keith