From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f66.google.com (mail-lf1-f66.google.com [209.85.167.66]) by dpdk.org (Postfix) with ESMTP id 36EFC4C74 for ; Tue, 11 Sep 2018 17:43:07 +0200 (CEST) Received: by mail-lf1-f66.google.com with SMTP id x207-v6so20755616lff.3 for ; Tue, 11 Sep 2018 08:43:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=QG+Wbp2Ql1tV8dnKxa6P4qh3GB11RPpCUQt5LV9ALMM=; b=AiGw0wW6P4lPzYqPsPk0wOVOrixXUATrqVmAQ3PL1y6tYQaGi5jtbc3poDZ8cSVPVM ZwOwC3RLlMTBa2Mwwixs5jAcnH26APxpuk7LIW7fxUZDBbHAwmDQO0AiNyIGQV+PkYOY Z9SI+EAk5VtA79dfUFK5PixlOdf6RO9jFETu6B2+pbcwArEQNnaS/g7jShtMp2o5rimU STLzTy4cEfDgAzCaxa+UcdKI0pJg7T0YnVea75qJZi8X1zPC1ry8nphy/TMHMmFtKLY7 /AhVwIwJmyUTILfUzY8QgYplZMggKWIqDwufUCRngJG7QutjXceM0oDeTKob/ND1cFCv XFcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=QG+Wbp2Ql1tV8dnKxa6P4qh3GB11RPpCUQt5LV9ALMM=; b=oJ/5rMUWoAZH5Hr4nOH+byy4srPgubbXhuEGS1EwJM+xmrFBbUZ8lGu15mTW0Cei6+ Aw6mcWOQCXvDnQ2DVeXuLgt6izbybf3SRmwR6tmPyGpvcjduuE3HtkL/Q3VHKgvpWR+O SdBGgY1JMplM9FUSodrh43p1pLIs/8dtxcurYj5nfqKo3+Oo/qcYB5MLt80uaWv1LSaN nm70HxfAEJfUTPMwv9ZnmSCd7celcVN2vr6nmyrjYpX/gCMK6FOkg6datk1WFjSa178L 3/aBm8WtFuHfA8oA63oVPrh94ck9h5j+VY52lSHouLkIC1jcqh7IgSfhR22XpfjW5FV+ ruag== X-Gm-Message-State: APzg51AnMjSIMrtQ5ImT+DYUI0JQBQS3KTBAgfwjwtlt1J21OoI0BpKn d3zrW2J9o2KluiEgQRjx4ph+g9r0SXAqGq6xqgbPhQ== X-Google-Smtp-Source: ANB0VdZqPtjMiMeqdBv98wvb72XVTTUqm1q7atvYUDT477dRRe3PbJPLn7qTjjF9xk9q8jzS63LSJOyVncxFGHgQZRI= X-Received: by 2002:a19:7ce:: with SMTP id 197-v6mr9001345lfh.139.1536680585979; Tue, 11 Sep 2018 08:43:05 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Arvind Narayanan Date: Tue, 11 Sep 2018 10:42:53 -0500 Message-ID: To: keith.wiles@intel.com, users@dpdk.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 15:43:07 -0000 Keith, thanks! My structure's size is 24 bytes, and for that particular for-loop, I do not dereference the rte_mbuf pointer, hence my understanding is it wouldn't require to load 4 cache lines, correct? I am only looking at the tags to make a decision and then simply move ahead on the fast-path. I tried the method suggested in ip_fragmentation example. I tried several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput. Here is my CPU info: Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Architecture: x86_64 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts Just to provide some more context, I isolate the CPU core used from the kernel for fast-path, hence this core is fully dedicated to the fast-path pipeline. The only time when the performance bumps from 7.7G to ~8.4G (still not close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL_F_NO_CACHE_ALIGN. Thanks, Arvind ---------- Forwarded message --------- From: Wiles, Keith Date: Tue, Sep 11, 2018 at 9:20 AM Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path To: Arvind Narayanan Cc: users@dpdk.org > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan wrote: > > Hi, > > I am trying to write a DPDK application and finding it difficult to achieve > line rate on a 10G NIC. I feel this has something to do with CPU caches and > related optimizations, and would be grateful if someone can point me to the > right direction. > > I wrap every rte_mbuf into my own structure say, my_packet. Here is > my_packet's structure declaration: > > ``` > struct my_packet { > struct rte_mbuf * m; > uint16_t tag1; > uint16_t tag2; > } > ``` The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf. For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-( Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall. Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. > > During initialization, I reserve a mempool of type struct my_packet with > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly > for freeing I put them back into pool as bursts. > > So there is a loop in the datapath which touches each of these my_packet's > tag to make a decision. > > ``` > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void > **)&val[i]) < 0) { > } > } > ``` > > Based on my tests, &(my_packet->tag1) is the cause for not letting me > achieve line rate in the fast path. I say this because if I hardcode the > tag1's value, I am able to achieve line rate. As a workaround, I tried to > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8 > my_packet(s) from my_packet[] array, but nothing seems to boost the > throughput. > > I tried to play with the flags in rte_mempool_create() function call: > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually > settles to ~8.5GB after 20 or 30 seconds. > -- NO FLAG gives 7.7G > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS. > > Any help or pointers are highly appreciated. > > Thanks, > Arvind Regards, Keith