From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f51.google.com (mail-wg0-f51.google.com [74.125.82.51]) by dpdk.org (Postfix) with ESMTP id 065745A76 for ; Thu, 5 Mar 2015 09:51:02 +0100 (CET) Received: by wggy19 with SMTP id y19so51788617wgg.13 for ; Thu, 05 Mar 2015 00:51:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=zm8y3GoFyd3QQOnBwuanND7RfgaVquSzgYZHjZD05gM=; b=TWLRccyzdDbjJ2/ymwxxlL9cExxuyg5tpd77uCBBPHJVDJCZt1bLaAlxwRwJ64cqVv MOsY80ZKvvPfrlkYs7MarNUwjkdxUgJBZKzlphD2f+SkmZL/9r+grk1ZWILRe43c5+Oi NLtcMbujkOHyVSfNZjaFBeH94eBWDTDUrGPGWfUMDfvJaPA3G+Z45XvOpus2+Jrbelqx pNmIrdNYD8agfbncnHIPjF/3zarLqt5ZRROcPqc1ayH81RkL42ZOyJj39gCZj5O/iHG+ M9/p0VPX7ANUf+MvhX1eeB0xWdigBvQymtXjZ5mpXpwIFrWeSJCQkRuoD68NuXkprV3Z 38VQ== MIME-Version: 1.0 X-Received: by 10.194.200.196 with SMTP id ju4mr15834852wjc.47.1425545461857; Thu, 05 Mar 2015 00:51:01 -0800 (PST) Received: by 10.28.87.15 with HTTP; Thu, 5 Mar 2015 00:51:01 -0800 (PST) In-Reply-To: References: Date: Thu, 5 Mar 2015 03:51:01 -0500 Message-ID: From: Anuj Kalia To: Parikshith Chowdaiah Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: dev@dpdk.org Subject: Re: [dpdk-dev] rte_prefetch0() performance info X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Mar 2015 08:51:02 -0000 Hi Parikshith. A CPU core can have a limited number of prefetches in flight (around 10). So if you issue 64 (or nb_rx > 10) prefetches in quick succession, you'll stall on memory access. The main idea here is to overlap prefetches of some packets with computation from other packets. This paper explains it in the context of hash tables, but the idea is similar: https://www.cs.cmu.edu/~binfan/papers/conext13_cuckooswitch.pdf --Anuj On Thu, Mar 5, 2015 at 3:46 AM, Parikshith Chowdaiah wrote: > Hi all, > I have a question related to usage of rte_prefetch0() function,In one of > the sample files, we have implementation like: > > /* Prefetch first packets */ > > for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) { > > rte_prefetch0(rte_pktmbuf_mtod( > > pkts_burst[j], void *)); > > } > > > > /* Prefetch and forward already prefetched packets */ > > for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { > > rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ > > j + PREFETCH_OFFSET], void *)); > > l3fwd_simple_forward(pkts_burst[j], portid, > > qconf); > > } > > > > /* Forward remaining prefetched packets */ > > for (; j < nb_rx; j++) { > > l3fwd_simple_forward(pkts_burst[j], portid, > > qconf); > > } > > > where the prefetch0() is carried out in multiple split iterations, would > like to have an insight on whether it makes performance improvement to > likes of: > > > > for (j = 0; j < nb_rx; j++) { > > rte_prefetch0(rte_pktmbuf_mtod( > > pkts_burst[j], void *)); > > } > > > and how frequent rte_prefetch() needs to called for the same packet. and > any mechanisms to call in bulk for 64 packets at once ? > > > thanks > > Parikshith >