From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <csz168117@iitd.ac.in>
Received: from smtp3.iitd.ac.in (smtp3.iitd.ac.in [103.27.11.44])
 by dpdk.org (Postfix) with ESMTP id 8BC42235
 for <dev@dpdk.org>; Mon, 23 Apr 2018 16:12:25 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by smtp3.iitd.ac.in (Postfix) with ESMTP id 6E0B840021;
 Mon, 23 Apr 2018 19:42:23 +0530 (IST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iitd.ac.in; h=
 content-transfer-encoding:content-language:content-type
 :content-type:in-reply-to:mime-version:user-agent:date:date
 :message-id:from:from:references:subject:subject:received
 :received; s=iitd; t=1524492738; x=1526307139; bh=lbFWqcwyPuQ/Kq
 NF3HFvyEXM0ihnpkALXBccp4tRpPE=; b=O+edG4s2mqOt13URKpwIiMqJa6MasU
 egAxhnQsdv5zwGJXL6s7Gz4lD/dreUsTVHYxwfdbLH5Kjnzq5zC4vRbPKymUitwE
 2B85131u3xmGWtze2RY+jYIumPeD7dHizbeJ9R2U/bNTBAeGqi2rK6bM0Xn4RRBE
 kFJE3znjbKTxc=
X-Virus-Scanned: Debian amavisd-new at smtp2.iitd.ernet.in
Received: from smtp3.iitd.ac.in ([127.0.0.1])
 by localhost (smtp3.iitd.ac.in [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id RLooY83-oLNv; Mon, 23 Apr 2018 19:42:18 +0530 (IST)
Received: from [10.237.23.28] (unknown [10.237.23.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested) (Authenticated sender: csz168117)
 by smtp3.iitd.ac.in (Postfix) with ESMTPSA id CD0224001F;
 Mon, 23 Apr 2018 19:42:18 +0530 (IST)
To: "Wiles, Keith" <keith.wiles@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
 <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com>
 <611770bb-d0c9-7f4e-9c3d-3b572c9e8023@iitd.ac.in>
 <51CED57F-6E68-4954-B79B-FC0A8723BF07@intel.com>
From: Shailja Pandey <csz168117@iitd.ac.in>
Message-ID: <a6c589ad-a55f-f530-1b3c-12958caf2157@iitd.ac.in>
Date: Mon, 23 Apr 2018 19:42:10 +0530
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <51CED57F-6E68-4954-B79B-FC0A8723BF07@intel.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Subject: Re: [dpdk-dev] Why packet replication is more efficient when done
 using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Apr 2018 14:12:26 -0000


On Thursday 19 April 2018 09:38 PM, Wiles, Keith wrote:
>
>> On Apr 19, 2018, at 9:30 AM, Shailja Pandey <csz168117@iitd.ac.in> wro=
te:
>>
>>> The two code fragments are doing two different ways the first is usin=
g a loop to create possible more then one replication and the second one =
is not, correct? The loop can cause performance hits, but should be small=
.
>> Sorry for the confusion, for memcpy version also we are using a loop o=
utside of this function. Essentially, we are making same number of copies=
 in both the cases.
>>> The first one is using the hdr->next pointer which is in the second c=
acheline of the mbuf header, this can and will cause a cacheline miss and=
 degrade your performance. The second code does not touch hdr->next and w=
ill not cause a cacheline miss. When the packet goes beyond 64bytes then =
you hit the second cacheline, are you starting to see the problem here.
>> We also performed same experiment for different packet sizes(64B, 128B=
, 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed onl=
y when the packet size increases from 64B to 128B and not after that. So,=
 cacheline miss should happen for other packet sizes also. I am not sure =
why this is the case. Why the drop is not sharp after 128 B packets when =
replicated using rte_pktmbuf_refcnt_update().
>>
>>>   Every time you touch a new cache line performance will drop unless =
the cacheline is prefetched into memory first, but in this case it really=
 can not be done easily. Count the cachelines you are touching and make s=
ure they are the same number in each case.
>> I don't understand the complexity here, could you please explain it in=
 detail.
> In this case you can not do a prefetch on other cache lines far enough =
in advance to not get a CPU stall for a cacheline.
>
>>> Why did you use memcpy and not rte_memcpy here as rte_memcpy should b=
e faster?
> Still did not answer this question.
>
>>> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce =
the number of rte_pktmbuf_alloc() calls, which should help if you know th=
e number of packets you need to replicate up front.
>> We are already using both of these functions, just to simplify the pse=
udo-code I used memcpy and rte_pktmbuf_alloc().
> Then please show the real code fragment as your example was confusing.

In our experiments, for packet replication using refcntupdate(), we=20
observed a sharp drop in throughput when the packet size was changed=20
from 64B to 128B because the replicated packets were not being sent.=20
Only the original packets were being sent hence throughput roughly=20
dropped to half compared to the case of 64B packets where both=20
replicated and original packets were being sent. Actually, the=20
ether_type field was not being set appropriately for replicated packets=20
and hence the replicated packets were dropped at hardware level.

We did not realize this, as in case of 64B packet this was not a problem=20
and NIC was able to transmit both original and replicated packets=20
despite ether_type field not being set appropriately. For 128B and=20
onward packets, replicated packets were sent by driver to NIC but not=20
being transmitted on the wire from NIC and hence a drop in throughput.

After setting the ether_type field appropriately for 128B and onwards=20
packet sizes, the throughput is similar for all packet sizes.

>
>> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256 bytes=
)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
>> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       =
|  memcpy  refcnt       | memcpy   refcnt         |
>>     5949888    5806720|   5831360    2890816  |  5640379    2886016 | =
 5107840   2863264  | 4510121   2692876    |
>>
> Refcnt also needs to adjust the value using a atomic update and you sti=
ll have not told me the type of system you are on x86 or ???
>
> Please describe your total system Host OS, DPDK version, NICs used, =E2=
=80=A6 a number of people have performance similar test and do not see th=
e problem you are suggesting. Maybe modify say L3fwd (which does some thi=
ng similar to your example code) and see if you still see the difference.=
 They you can post the patch to that example app and we can try to figure=
 it out.
>
>> Throughput is in MPPS.
>>
>> --=20
>>
>> Thanks,
>> Shailja
>>
> Regards,
> Keith
>
Thank again!

--=20

Thanks,
Shailja