From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <keith.wiles@intel.com>
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by dpdk.org (Postfix) with ESMTP id EECD35F32
 for <dev@dpdk.org>; Wed, 18 Apr 2018 20:36:36 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 18 Apr 2018 11:36:35 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.48,466,1517904000"; d="scan'208";a="51815077"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by orsmga002.jf.intel.com with ESMTP; 18 Apr 2018 11:36:35 -0700
Received: from fmsmsx158.amr.corp.intel.com (10.18.116.75) by
 FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS)
 id 14.3.319.2; Wed, 18 Apr 2018 11:36:35 -0700
Received: from fmsmsx117.amr.corp.intel.com ([169.254.3.69]) by
 fmsmsx158.amr.corp.intel.com ([169.254.15.208]) with mapi id 14.03.0319.002;
 Wed, 18 Apr 2018 11:36:35 -0700
From: "Wiles, Keith" <keith.wiles@intel.com>
To: Shailja Pandey <csz168117@iitd.ac.in>
CC: "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [dpdk-dev] Why packet replication is more efficient when done
 using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Thread-Index: AQHT1zSIWFXgHyDHYEuQY3lmGtyZ26QHT0cA
Date: Wed, 18 Apr 2018 18:36:34 +0000
Message-ID: <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com>
References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
In-Reply-To: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.252.129.229]
Content-Type: text/plain; charset="us-ascii"
Content-ID: <3352008C98427D45B75CB715D51999C8@intel.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] Why packet replication is more efficient when done
 using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Apr 2018 18:36:38 -0000


> On Apr 18, 2018, at 11:43 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote=
:
>=20
> Hello,
>=20
> I am doing packet replication and I need to change the ethernet and IP he=
ader field for each replicated packet. I did it in two different ways:
>=20
> 1. Share payload from the original packet using rte_mbuf_refcnt_update
>   and allocate new mbuf for L2-L4 headers.
> 2. memcpy() payload from the original packet to newly created mbuf and
>   prepend L2-L4 headers to the mbuf.
>=20
> I performed experiments with varying replication factor as well as varyin=
g packet size and found that memcpy() is performing way better than using r=
te_mbuf_refcnt_update(). But I am not sure why it is happening and what is =
making rte_mbuf_refcnt_update() even worse than memcpy().
>=20
> Here is the sample code for both implementations:


The two code fragments are doing two different ways the first is using a lo=
op to create possible more then one replication and the second one is not, =
correct? The loop can cause performance hits, but should be small.

The first one is using the hdr->next pointer which is in the second cacheli=
ne of the mbuf header, this can and will cause a cacheline miss and degrade=
 your performance. The second code does not touch hdr->next and will not ca=
use a cacheline miss. When the packet goes beyond 64bytes then you hit the =
second cacheline, are you starting to see the problem here. Every time you =
touch a new cache line performance will drop unless the cacheline is prefet=
ched into memory first, but in this case it really can not be done easily. =
Count the cachelines you are touching and make sure they are the same numbe=
r in each case.

On Intel x86 systems 64 byte is the cacheline size and other arches have di=
fferent sizes.

>=20
> *1. Using rte_mbuf_refcnt_update:*
> **struct rte_mbuf *pkt =3D original packet;**
> **
> ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(stru=
ct ipv4_hdr));
>         rte_pktmbuf_refcnt_update(pkt, replication_factor);
>         for(int i =3D 0; i < replication_factor; i++) {
>               struct rte_mbuf *hdr;
>               if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=
=3D NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>            }
>            hdr->next =3D pkt;
>            hdr->pkt_len =3D (uint16_t)(hdr->data_len + pkt->pkt_len);
>            hdr->nb_segs =3D (uint8_t)(pkt->nb_segs + 1);
>            //*Update more metadate fields*
> *
> *
> **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr));
>             //*modify L2 fields*
>=20
>             rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr));
>             //Modify L3 fields
>             .
>             .
>             .
>         }
> *
> *
> *
> *
> *2. Using memcpy():*
> **struct rte_mbuf *pkt =3D original packet
> **struct rte_mbuf *hdr;**
>         if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=3D NUL=
L)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>         }
>=20
>         /* prepend new header */
>         char *eth_hdr =3D (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len);
>         if(eth_hdr =3D=3D NULL) {
>                 printf("panic\n");
>         }
>         char *b =3D rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *);
>         memcpy(eth_hdr, b, pkt->pkt_len);
>         Change L2-L4 header fields in new packet
>=20
> The throughput becomes roughly half when the packet size is increased fro=
m 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_upda=
te(). *The throughput remains more or less same when packet size increases =
and replication is done using *memcpy()*.

Why did you use memcpy and not rte_memcpy here as rte_memcpy should be fast=
er?

I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the nu=
mber of rte_pktmbuf_alloc() calls, which should help if you know the number=
 of packets you need to replicate up front.

>=20
> Any help would be appreciated.
> **
>=20
> --
>=20
> Thanks,
> Shailja
>=20

Regards,
Keith