From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 8BEBB4C74 for ; Tue, 11 Sep 2018 16:20:06 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 11 Sep 2018 07:20:05 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,360,1531810800"; d="scan'208";a="90734332" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by orsmga002.jf.intel.com with ESMTP; 11 Sep 2018 07:20:01 -0700 Received: from fmsmsx117.amr.corp.intel.com ([169.254.3.210]) by FMSMSX108.amr.corp.intel.com ([169.254.9.232]) with mapi id 14.03.0319.002; Tue, 11 Sep 2018 07:20:01 -0700 From: "Wiles, Keith" To: Arvind Narayanan CC: "users@dpdk.org" Thread-Topic: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path Thread-Index: AQHUSae1ETeX649Rb0SmeeAQ5LM/lqTrlxUA Date: Tue, 11 Sep 2018 14:20:00 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.254.188.247] Content-Type: text/plain; charset="us-ascii" Content-ID: <20E37CE523AF774A8274C04558637939@intel.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 14:20:07 -0000 > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan wro= te: >=20 > Hi, >=20 > I am trying to write a DPDK application and finding it difficult to achie= ve > line rate on a 10G NIC. I feel this has something to do with CPU caches a= nd > related optimizations, and would be grateful if someone can point me to t= he > right direction. >=20 > I wrap every rte_mbuf into my own structure say, my_packet. Here is > my_packet's structure declaration: >=20 > ``` > struct my_packet { > struct rte_mbuf * m; > uint16_t tag1; > uint16_t tag2; > } > ``` The only problem you have created is having to pull in another cache line b= y having to access my_packet structure. The mbuf is highly optimized to lim= it the number of cache lines required to be pulled into cache for an mbuf. = The mbuf structure is split between RX and TX, when doing TX you touch one = of the two cache lines the mbuf is contained in and RX you touch the other = cache line, at least that is the reason for the order of the members in the= mbuf. For the most port accessing a packet of data takes about 2-3 cache lines to= load into memory. Getting the prefetches far enough in advance to get the = cache lines into top level cache is hard to do. In one case if I removed th= e prefetches the performance increased not decreased. :-( Sound like you are hitting this problem of now loading 4 cache lines and th= is causes the CPU to stall. One method is to prefetch the packets in a list= then prefetch the a number of cache lines in advanced then start processin= g the first packet of data. In some case I have seen prefetching 3 packets = worth of cache lines helps. YMMV You did not list processor you are using, but Intel Xeon processors have a = limit to the number of outstanding prefetches you can have at a time, I thi= nk 8 is the number. Also VPP at fd.io does use this method too in order to = prefetch the data and not allow the CPU to stall. Look in the examples/ip_fragmentation/main.c and look at the code that pref= etches mbufs and data structures. I hope that one helps.=20 >=20 > During initialization, I reserve a mempool of type struct my_packet with > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly > for freeing I put them back into pool as bursts. >=20 > So there is a loop in the datapath which touches each of these my_packet'= s > tag to make a decision. >=20 > ``` > for (i =3D 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > **)&val[i]) < 0) { > } > } > ``` >=20 > Based on my tests, &(my_packet->tag1) is the cause for not letting me > achieve line rate in the fast path. I say this because if I hardcode the > tag1's value, I am able to achieve line rate. As a workaround, I tried to > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > my_packet(s) from my_packet[] array, but nothing seems to boost the > throughput. >=20 > I tried to play with the flags in rte_mempool_create() function call: > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > settles to ~8.5GB after 20 or 30 seconds. > -- NO FLAG gives 7.7G >=20 > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. >=20 > Any help or pointers are highly appreciated. >=20 > Thanks, > Arvind Regards, Keith