Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Wiles, Keith" <keith.wiles@intel.com>
To: Shailja Pandey <csz168117@iitd.ac.in>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Date: Wed, 18 Apr 2018 18:36:34 +0000	[thread overview]
Message-ID: <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com> (raw)
In-Reply-To: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>

> On Apr 18, 2018, at 11:43 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote:
> 
> Hello,
> 
> I am doing packet replication and I need to change the ethernet and IP header field for each replicated packet. I did it in two different ways:
> 
> 1. Share payload from the original packet using rte_mbuf_refcnt_update
>   and allocate new mbuf for L2-L4 headers.
> 2. memcpy() payload from the original packet to newly created mbuf and
>   prepend L2-L4 headers to the mbuf.
> 
> I performed experiments with varying replication factor as well as varying packet size and found that memcpy() is performing way better than using rte_mbuf_refcnt_update(). But I am not sure why it is happening and what is making rte_mbuf_refcnt_update() even worse than memcpy().
> 
> Here is the sample code for both implementations:

The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second one is not, correct? The loop can cause performance hits, but should be small.

The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes beyond 64bytes then you hit the second cacheline, are you starting to see the problem here. Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.

On Intel x86 systems 64 byte is the cacheline size and other arches have different sizes.

> 
> *1. Using rte_mbuf_refcnt_update:*
> **struct rte_mbuf *pkt = original packet;**
> **
> ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(struct ipv4_hdr));
>         rte_pktmbuf_refcnt_update(pkt, replication_factor);
>         for(int i = 0; i < replication_factor; i++) {
>               struct rte_mbuf *hdr;
>               if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>            }
>            hdr->next = pkt;
>            hdr->pkt_len = (uint16_t)(hdr->data_len + pkt->pkt_len);
>            hdr->nb_segs = (uint8_t)(pkt->nb_segs + 1);
>            //*Update more metadate fields*
> *
> *
> **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr));
>             //*modify L2 fields*
> 
>             rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr));
>             //Modify L3 fields
>             .
>             .
>             .
>         }
> *
> *
> *
> *
> *2. Using memcpy():*
> **struct rte_mbuf *pkt = original packet
> **struct rte_mbuf *hdr;**
>         if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>         }
> 
>         /* prepend new header */
>         char *eth_hdr = (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len);
>         if(eth_hdr == NULL) {
>                 printf("panic\n");
>         }
>         char *b = rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *);
>         memcpy(eth_hdr, b, pkt->pkt_len);
>         Change L2-L4 header fields in new packet
> 
> The throughput becomes roughly half when the packet size is increased from 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_update(). *The throughput remains more or less same when packet size increases and replication is done using *memcpy()*.

Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?

I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you know the number of packets you need to replicate up front.

> 
> Any help would be appreciated.
> **
> 
> --
> 
> Thanks,
> Shailja
> 

Regards,
Keith

next prev parent reply	other threads:[~2018-04-18 18:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-18 16:43 Shailja Pandey
2018-04-18 18:36 ` Wiles, Keith [this message]
2018-04-19 14:30   ` Shailja Pandey
2018-04-19 16:08     ` Wiles, Keith
2018-04-23 14:12       ` Shailja Pandey
2018-04-20 10:05     ` Ananyev, Konstantin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com \
    --to=keith.wiles@intel.com \
    --cc=csz168117@iitd.ac.in \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).