From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp3.iitd.ac.in (smtp3.iitd.ac.in [103.27.11.44]) by dpdk.org (Postfix) with ESMTP id 8BC42235 for ; Mon, 23 Apr 2018 16:12:25 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by smtp3.iitd.ac.in (Postfix) with ESMTP id 6E0B840021; Mon, 23 Apr 2018 19:42:23 +0530 (IST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iitd.ac.in; h= content-transfer-encoding:content-language:content-type :content-type:in-reply-to:mime-version:user-agent:date:date :message-id:from:from:references:subject:subject:received :received; s=iitd; t=1524492738; x=1526307139; bh=lbFWqcwyPuQ/Kq NF3HFvyEXM0ihnpkALXBccp4tRpPE=; b=O+edG4s2mqOt13URKpwIiMqJa6MasU egAxhnQsdv5zwGJXL6s7Gz4lD/dreUsTVHYxwfdbLH5Kjnzq5zC4vRbPKymUitwE 2B85131u3xmGWtze2RY+jYIumPeD7dHizbeJ9R2U/bNTBAeGqi2rK6bM0Xn4RRBE kFJE3znjbKTxc= X-Virus-Scanned: Debian amavisd-new at smtp2.iitd.ernet.in Received: from smtp3.iitd.ac.in ([127.0.0.1]) by localhost (smtp3.iitd.ac.in [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id RLooY83-oLNv; Mon, 23 Apr 2018 19:42:18 +0530 (IST) Received: from [10.237.23.28] (unknown [10.237.23.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: csz168117) by smtp3.iitd.ac.in (Postfix) with ESMTPSA id CD0224001F; Mon, 23 Apr 2018 19:42:18 +0530 (IST) To: "Wiles, Keith" Cc: "dev@dpdk.org" References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> <611770bb-d0c9-7f4e-9c3d-3b572c9e8023@iitd.ac.in> <51CED57F-6E68-4954-B79B-FC0A8723BF07@intel.com> From: Shailja Pandey Message-ID: Date: Mon, 23 Apr 2018 19:42:10 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <51CED57F-6E68-4954-B79B-FC0A8723BF07@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Apr 2018 14:12:26 -0000 On Thursday 19 April 2018 09:38 PM, Wiles, Keith wrote: > >> On Apr 19, 2018, at 9:30 AM, Shailja Pandey wro= te: >> >>> The two code fragments are doing two different ways the first is usin= g a loop to create possible more then one replication and the second one = is not, correct? The loop can cause performance hits, but should be small= . >> Sorry for the confusion, for memcpy version also we are using a loop o= utside of this function. Essentially, we are making same number of copies= in both the cases. >>> The first one is using the hdr->next pointer which is in the second c= acheline of the mbuf header, this can and will cause a cacheline miss and= degrade your performance. The second code does not touch hdr->next and w= ill not cause a cacheline miss. When the packet goes beyond 64bytes then = you hit the second cacheline, are you starting to see the problem here. >> We also performed same experiment for different packet sizes(64B, 128B= , 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed onl= y when the packet size increases from 64B to 128B and not after that. So,= cacheline miss should happen for other packet sizes also. I am not sure = why this is the case. Why the drop is not sharp after 128 B packets when = replicated using rte_pktmbuf_refcnt_update(). >> >>> Every time you touch a new cache line performance will drop unless = the cacheline is prefetched into memory first, but in this case it really= can not be done easily. Count the cachelines you are touching and make s= ure they are the same number in each case. >> I don't understand the complexity here, could you please explain it in= detail. > In this case you can not do a prefetch on other cache lines far enough = in advance to not get a CPU stall for a cacheline. > >>> Why did you use memcpy and not rte_memcpy here as rte_memcpy should b= e faster? > Still did not answer this question. > >>> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce = the number of rte_pktmbuf_alloc() calls, which should help if you know th= e number of packets you need to replicate up front. >> We are already using both of these functions, just to simplify the pse= udo-code I used memcpy and rte_pktmbuf_alloc(). > Then please show the real code fragment as your example was confusing. In our experiments, for packet replication using refcntupdate(), we=20 observed a sharp drop in throughput when the packet size was changed=20 from 64B to 128B because the replicated packets were not being sent.=20 Only the original packets were being sent hence throughput roughly=20 dropped to half compared to the case of 64B packets where both=20 replicated and original packets were being sent. Actually, the=20 ether_type field was not being set appropriately for replicated packets=20 and hence the replicated packets were dropped at hardware level. We did not realize this, as in case of 64B packet this was not a problem=20 and NIC was able to transmit both original and replicated packets=20 despite ether_type field not being set appropriately. For 128B and=20 onward packets, replicated packets were sent by driver to NIC but not=20 being transmitted on the wire from NIC and hence a drop in throughput. After setting the ether_type field appropriately for 128B and onwards=20 packet sizes, the throughput is similar for all packet sizes. > >> # pktsz 1(64 bytes) | pktsz 2(128 bytes) | pktsz 3(256 bytes= ) | pktsz 4(512 bytes) | pktsz 4(1024 bytes) | >> # memcpy refcnt | memcpy refcnt | memcpy refcnt = | memcpy refcnt | memcpy refcnt | >> 5949888 5806720| 5831360 2890816 | 5640379 2886016 | = 5107840 2863264 | 4510121 2692876 | >> > Refcnt also needs to adjust the value using a atomic update and you sti= ll have not told me the type of system you are on x86 or ??? > > Please describe your total system Host OS, DPDK version, NICs used, =E2= =80=A6 a number of people have performance similar test and do not see th= e problem you are suggesting. Maybe modify say L3fwd (which does some thi= ng similar to your example code) and see if you still see the difference.= They you can post the patch to that example app and we can try to figure= it out. > >> Throughput is in MPPS. >> >> --=20 >> >> Thanks, >> Shailja >> > Regards, > Keith > Thank again! --=20 Thanks, Shailja