From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f66.google.com (mail-lf1-f66.google.com [209.85.167.66]) by dpdk.org (Postfix) with ESMTP id 6A5962BFA for ; Tue, 11 Sep 2018 19:18:55 +0200 (CEST) Received: by mail-lf1-f66.google.com with SMTP id z11-v6so20984491lff.9 for ; Tue, 11 Sep 2018 10:18:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CZtb4zmbe0we8k8XaqRUPk7BOlmustUIKfV/7BkZLTs=; b=r9Bx+pXf2iYQ9RG3sAaCWxaprdTM+s9niIakMrsOXaj2dYmxughwnz3Zov7nxZqEJj MRmgOF3F93oVsQjN65zRO+fNeCanAUsz1zU0CBy3OFOt/9oUT9VQUycY1ho+i4f0KZ9X 8PhBZjt0Q5zKz8FFZgbnMsmsPcUGtkd2rqTVL817M8GduR/IpSiu+ztL4WtFf2sc1Dvx rJRA3uMT3aEBITdsw+B8VUIlsgfpAxVkoob3wJWiQIboV/Vz+CPjiL8MmI2mdiTdUc8I 1fC5UPuYSQ31rgVk/2MU9GVGOy/+7LZUMGJH2mLv9A7eqY3YnNelOjvIRuDd0KsjMylg cDiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CZtb4zmbe0we8k8XaqRUPk7BOlmustUIKfV/7BkZLTs=; b=kueEMM3F3Nuu8ZdpUoMojU5Mot7KcNdVk170P37Ev3ABgu8y5OmSjrSrsiThcGxSnp iA2ozwcSrbPDXiRxrjrxmHWSL25P4PiHCO2heC49hMPcpdtWq5NPuNV9SHy3bF5TxJL9 Wsbvp6k6FRhUNOS7kdiWCLjfqnwQcuSc8CumAWvDABgmeIiCkhGoOnf1IERvP3l3mWCB wgFdiKIYqizUDPC9iStzit+LGzrca3O2RVh37qPXvBPEd8kDpGl/lNrISLbXSpGAP8nB bAD1vZUavS0TifJ51GRfNs3sp3sZ3uTNmTReOyZ3NFxpoegBzUY0pTBZ6kFFCQgiAT2h rf7g== X-Gm-Message-State: APzg51CwJBrioKxhZVN95XJLX18VCZMsgiZlhwPF5Qqu8QYjfePjLkg6 fZOGkKf+YCcXzzkNOzagWKKMg78a5jBMXWpLyew= X-Google-Smtp-Source: ANB0VdZTYshM7geubZ/eD+Q1BXu0Vn0wksZvq4UK2u+TBIHok7rVjh1YQqhNLwUdMCriZVi042OttzFxzbk2frk2FrY= X-Received: by 2002:a19:d98f:: with SMTP id s15-v6mr17123448lfi.103.1536686334640; Tue, 11 Sep 2018 10:18:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Arvind Narayanan Date: Tue, 11 Sep 2018 12:18:42 -0500 Message-ID: To: keith.wiles@intel.com Cc: users@dpdk.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 17:18:55 -0000 If I don't do any processing, I easily get 10G. It is only when I access the tag when the throughput drops. What confuses me is if I use the following snippet, it works at line rate. ``` int temp_key = 1; // declared outside of the for loop for (i = 0; i < pkt_count; i++) { if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { } } ``` But as soon as I replace `temp_key` with `my_packet->tag1`, I experience fall in throughput (which in a way confirms the issue is due to cache misses). __rte_cache_aligned may not be required as the mempool from where I pull the pre-allocated structs already cache aligns them. But let me try putting it as well to the struct to make sure. Yes, I did come across VTune, and there is a free trial period which I guess would help me confirm that it is due the cache misses. l2fwd and l3fwd easily achieves 10G. :( Thanks, Arvind On Tue, Sep 11, 2018 at 11:52 AM Wiles, Keith wrote: > > > > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan > wrote: > > > > Keith, thanks! > > > > My structure's size is 24 bytes, and for that particular for-loop, I do > not dereference the rte_mbuf pointer, hence my understanding is it wouldn't > require to load 4 cache lines, correct? > > I am only looking at the tags to make a decision and then simply move > ahead on the fast-path. > > The mbufs does get accessed by the Rx path, so a cacheline is pulled. If > you are not accessing the mbuf structure or data then I am not sure what is > the problem. The my_packet structure is it starting on a cacheline and have > you tried putting each structure on a cacheline using __rte_cache_aligned? > > Have you used vtune or some of the other tools in the intel site? > https://software.intel.com/en-us/intel-vtune-amplifier-xe > > Not sure about cost or anything. Vtune is a great tool, but for me it does > have some learning curve to understand the output. > > A Xeon core of this type should be able to forward packets nicely at 10G > 64 byte frames. Maybe just do the normal Rx then process, but do not do all > of the processing then send it back out like a dumb forwarder. Are the > NIC(s) and cores on the same socket, if you have a multi-socket system? > Just shooting in the dark here. > > Also did you try l2fwd or l3fwd example and see if that app can get to 10G. > > > > > I tried the method suggested in ip_fragmentation example. I tried > several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost > throughput. > > > > Here is my CPU info: > > > > Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > > Architecture: x86_64 > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 256K > > L3 cache: 15360K > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good > nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor > ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic > popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi > flexpriority ept vpid xsaveopt dtherm ida arat pln pts > > > > Just to provide some more context, I isolate the CPU core used from the > kernel for fast-path, hence this core is fully dedicated to the fast-path > pipeline. > > > > The only time when the performance bumps from 7.7G to ~8.4G (still not > close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or > MEMPOOL_F_NO_CACHE_ALIGN. > > > > Thanks, > > Arvind > > > > ---------- Forwarded message --------- > > From: Wiles, Keith > > Date: Tue, Sep 11, 2018 at 9:20 AM > > Subject: Re: [dpdk-users] How to use software prefetching for custom > structures to increase throughput on the fast path > > To: Arvind Narayanan > > Cc: users@dpdk.org > > > > > > > > > > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan > wrote: > > > > > > Hi, > > > > > > I am trying to write a DPDK application and finding it difficult to > achieve > > > line rate on a 10G NIC. I feel this has something to do with CPU > caches and > > > related optimizations, and would be grateful if someone can point me > to the > > > right direction. > > > > > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > > > my_packet's structure declaration: > > > > > > ``` > > > struct my_packet { > > > struct rte_mbuf * m; > > > uint16_t tag1; > > > uint16_t tag2; > > > } > > > ``` > > > > The only problem you have created is having to pull in another cache > line by having to access my_packet structure. The mbuf is highly optimized > to limit the number of cache lines required to be pulled into cache for an > mbuf. The mbuf structure is split between RX and TX, when doing TX you > touch one of the two cache lines the mbuf is contained in and RX you touch > the other cache line, at least that is the reason for the order of the > members in the mbuf. > > > > For the most port accessing a packet of data takes about 2-3 cache lines > to load into memory. Getting the prefetches far enough in advance to get > the cache lines into top level cache is hard to do. In one case if I > removed the prefetches the performance increased not decreased. :-( > > > > Sound like you are hitting this problem of now loading 4 cache lines and > this causes the CPU to stall. One method is to prefetch the packets in a > list then prefetch the a number of cache lines in advanced then start > processing the first packet of data. In some case I have seen prefetching 3 > packets worth of cache lines helps. YMMV > > > > You did not list processor you are using, but Intel Xeon processors have > a limit to the number of outstanding prefetches you can have at a time, I > think 8 is the number. Also VPP at fd.io does use this method too in > order to prefetch the data and not allow the CPU to stall. > > > > Look in the examples/ip_fragmentation/main.c and look at the code that > prefetches mbufs and data structures. I hope that one helps. > > > > > > > > During initialization, I reserve a mempool of type struct my_packet > with > > > 8192 elements. Whenever I form my_packet, I get them in bursts, > similarly > > > for freeing I put them back into pool as bursts. > > > > > > So there is a loop in the datapath which touches each of these > my_packet's > > > tag to make a decision. > > > > > > ``` > > > for (i = 0; i < pkt_count; i++) { > > > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > > > **)&val[i]) < 0) { > > > } > > > } > > > ``` > > > > > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > > > achieve line rate in the fast path. I say this because if I hardcode > the > > > tag1's value, I am able to achieve line rate. As a workaround, I tried > to > > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > > > my_packet(s) from my_packet[] array, but nothing seems to boost the > > > throughput. > > > > > > I tried to play with the flags in rte_mempool_create() function call: > > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > > > settles to ~8.5GB after 20 or 30 seconds. > > > -- NO FLAG gives 7.7G > > > > > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > > > > > Any help or pointers are highly appreciated. > > > > > > Thanks, > > > Arvind > > > > Regards, > > Keith > > > > Regards, > Keith > >