From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id EECD35F32 for ; Wed, 18 Apr 2018 20:36:36 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Apr 2018 11:36:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,466,1517904000"; d="scan'208";a="51815077" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by orsmga002.jf.intel.com with ESMTP; 18 Apr 2018 11:36:35 -0700 Received: from fmsmsx158.amr.corp.intel.com (10.18.116.75) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.319.2; Wed, 18 Apr 2018 11:36:35 -0700 Received: from fmsmsx117.amr.corp.intel.com ([169.254.3.69]) by fmsmsx158.amr.corp.intel.com ([169.254.15.208]) with mapi id 14.03.0319.002; Wed, 18 Apr 2018 11:36:35 -0700 From: "Wiles, Keith" To: Shailja Pandey CC: "dev@dpdk.org" Thread-Topic: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? Thread-Index: AQHT1zSIWFXgHyDHYEuQY3lmGtyZ26QHT0cA Date: Wed, 18 Apr 2018 18:36:34 +0000 Message-ID: References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> In-Reply-To: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.252.129.229] Content-Type: text/plain; charset="us-ascii" Content-ID: <3352008C98427D45B75CB715D51999C8@intel.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Apr 2018 18:36:38 -0000 > On Apr 18, 2018, at 11:43 AM, Shailja Pandey wrote= : >=20 > Hello, >=20 > I am doing packet replication and I need to change the ethernet and IP he= ader field for each replicated packet. I did it in two different ways: >=20 > 1. Share payload from the original packet using rte_mbuf_refcnt_update > and allocate new mbuf for L2-L4 headers. > 2. memcpy() payload from the original packet to newly created mbuf and > prepend L2-L4 headers to the mbuf. >=20 > I performed experiments with varying replication factor as well as varyin= g packet size and found that memcpy() is performing way better than using r= te_mbuf_refcnt_update(). But I am not sure why it is happening and what is = making rte_mbuf_refcnt_update() even worse than memcpy(). >=20 > Here is the sample code for both implementations: The two code fragments are doing two different ways the first is using a lo= op to create possible more then one replication and the second one is not, = correct? The loop can cause performance hits, but should be small. The first one is using the hdr->next pointer which is in the second cacheli= ne of the mbuf header, this can and will cause a cacheline miss and degrade= your performance. The second code does not touch hdr->next and will not ca= use a cacheline miss. When the packet goes beyond 64bytes then you hit the = second cacheline, are you starting to see the problem here. Every time you = touch a new cache line performance will drop unless the cacheline is prefet= ched into memory first, but in this case it really can not be done easily. = Count the cachelines you are touching and make sure they are the same numbe= r in each case. On Intel x86 systems 64 byte is the cacheline size and other arches have di= fferent sizes. >=20 > *1. Using rte_mbuf_refcnt_update:* > **struct rte_mbuf *pkt =3D original packet;** > ** > ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(stru= ct ipv4_hdr)); > rte_pktmbuf_refcnt_update(pkt, replication_factor); > for(int i =3D 0; i < replication_factor; i++) { > struct rte_mbuf *hdr; > if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D= =3D NULL)) { > printf("Failed while cloning $$$\n"); > return NULL; > } > hdr->next =3D pkt; > hdr->pkt_len =3D (uint16_t)(hdr->data_len + pkt->pkt_len); > hdr->nb_segs =3D (uint8_t)(pkt->nb_segs + 1); > //*Update more metadate fields* > * > * > **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr)); > //*modify L2 fields* >=20 > rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr)); > //Modify L3 fields > . > . > . > } > * > * > * > * > *2. Using memcpy():* > **struct rte_mbuf *pkt =3D original packet > **struct rte_mbuf *hdr;** > if (unlikely ((hdr =3D rte_pktmbuf_alloc(header_pool)) =3D=3D NUL= L)) { > printf("Failed while cloning $$$\n"); > return NULL; > } >=20 > /* prepend new header */ > char *eth_hdr =3D (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len); > if(eth_hdr =3D=3D NULL) { > printf("panic\n"); > } > char *b =3D rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *); > memcpy(eth_hdr, b, pkt->pkt_len); > Change L2-L4 header fields in new packet >=20 > The throughput becomes roughly half when the packet size is increased fro= m 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_upda= te(). *The throughput remains more or less same when packet size increases = and replication is done using *memcpy()*. Why did you use memcpy and not rte_memcpy here as rte_memcpy should be fast= er? I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the nu= mber of rte_pktmbuf_alloc() calls, which should help if you know the number= of packets you need to replicate up front. >=20 > Any help would be appreciated. > ** >=20 > -- >=20 > Thanks, > Shailja >=20 Regards, Keith