Re: [PATCH] app/testpmd: improve sse based macswap

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: Vipin Varghese <vipin.varghese@amd.com>
Cc: <dev@dpdk.org>, <ferruh.yigit@amd.com>, <konstantin.v.ananyev@yandex.ru>
Subject: Re: [PATCH] app/testpmd: improve sse based macswap
Date: Mon, 15 Jul 2024 16:07:45 +0100	[thread overview]
Message-ID: <ZpU7QZWv0LmRoMYs@bricha3-mobl1.ger.corp.intel.com> (raw)
In-Reply-To: <20240713151949.832-1-vipin.varghese@amd.com>

On Sat, Jul 13, 2024 at 08:49:49PM +0530, Vipin Varghese wrote:
> Goal of the patch is to improve SSE macswap on x86_64 by reducing
> the stalls in backend engine. Original implementation of the SSE
> macswap makes loop call to multiple load, shuffle & store. Using
> SIMD ISA interleaving we can reduce the stalls for
>  - load SSE token exhaustion
>  - Shuffle and Load dependency
> 
> Also other changes which improves packet per second are
>  - Filling access to MBUF for offload flags which is separate cacheline,
>  - using register keyword
> 
> Test results:
> ------------
> Platform: AMD EPYC SIENA 8594P @2.3GHz, no boost
> DPDK: 24.03
> 
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
>  - mellanox CX-7 2*200Gbps : 42.0
>  - intel E810 1*100Gbps : 82.0
>  - intel E810 2*200Gbps (2CQ-DA2): 83.0
> ------------------------------------------------
> TEST MACSWAP 64B: <NIC : Before : After>
>  - mellanox CX-7 2*200Gbps : 31.533 : 31.90
>  - intel E810 1*100Gbps : 50.380 : 47.0
>  - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
> ------------------------------------------------
> TEST MACSWAP 128B: <NIC : Before: After>
>  - mellanox CX-7 2*200Gbps: 30.946 : 31.770
>  - intel E810 1*100Gbps: 49.386 : 46.366
>  - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
> ------------------------------------------------
> TEST MACSWAP 256B: <NIC: Before: After>
>  - mellanox CX-7 2*200Gbps: 32.480 : 33.150
>  - intel E810 1 * 100Gbps: 45.29 : 44.571
>  - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
> ------------------------------------------------
> 

Hi,

interesting patch. Do you know why we see regressions in some of the cases
above? For 1x100G at 64B and 128B packet sizes we see perf drops of 3mpps
vs smaller gains in the other two cases at each size (much smaller in the
64B case).

Couple of other questions inline below too.

Thanks,
/Bruce

> using multiple queues and lcore there is linear increase in MPPs.
> 
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
> ---
>  app/test-pmd/macswap_sse.h | 40 ++++++++++++++++++--------------------
>  1 file changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
> index 223f87a539..a3d3a274e5 100644
> --- a/app/test-pmd/macswap_sse.h
> +++ b/app/test-pmd/macswap_sse.h
> @@ -11,21 +11,21 @@ static inline void
>  do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  		struct rte_port *txp)
>  {
> -	struct rte_ether_hdr *eth_hdr[4];
> -	struct rte_mbuf *mb[4];
> +	register struct rte_ether_hdr *eth_hdr[8];
> +	register struct rte_mbuf *mb[8];

Does using "register" actually make a difference to the generated code?

Also, why increasing the array sizes from 4 to 8 - the actual code only
uses 4 elements of each array below anyway? Is it for cache alignment
purposes perhaps - if so, please use explicit cache alignment attributes to
specify this rather than having it implicit in the array sizes.

>  	uint64_t ol_flags;
>  	int i;
>  	int r;
> -	__m128i addr0, addr1, addr2, addr3;
> +	register __m128i addr0, addr1, addr2, addr3;
>  	/**
>  	 * shuffle mask be used to shuffle the 16 bytes.
>  	 * byte 0-5 wills be swapped with byte 6-11.
>  	 * byte 12-15 will keep unchanged.
>  	 */
> -	__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
> -					5, 4, 3, 2,
> -					1, 0, 11, 10,
> -					9, 8, 7, 6);
> +	register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
> +							5, 4, 3, 2,
> +							1, 0, 11, 10,
> +							9, 8, 7, 6);
>  
>  	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
>  	vlan_qinq_set(pkts, nb, ol_flags,
> @@ -44,23 +44,24 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  
>  		mb[0] = pkts[i++];
>  		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
> -		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> -
>  		mb[1] = pkts[i++];
>  		eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *);
> -		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
> -
> -
>  		mb[2] = pkts[i++];
>  		eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
> -		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
> -
>  		mb[3] = pkts[i++];
>  		eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
> -		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
>  
> +		/* Interleave load, shuffle & set */
> +		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> +		mbuf_field_set(mb[0], ol_flags);
> +		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
> +		mbuf_field_set(mb[1], ol_flags);
>  		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
> +		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
> +		mbuf_field_set(mb[2], ol_flags);
>  		addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
> +		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
> +		mbuf_field_set(mb[3], ol_flags);
>  		addr2 = _mm_shuffle_epi8(addr2, shfl_msk);
>  		addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
>  
> @@ -69,25 +70,22 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  		_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
>  		_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
>  
> -		mbuf_field_set(mb[0], ol_flags);
> -		mbuf_field_set(mb[1], ol_flags);
> -		mbuf_field_set(mb[2], ol_flags);
> -		mbuf_field_set(mb[3], ol_flags);
>  		r -= 4;
>  	}
>  
>  	for ( ; i < nb; i++) {
>  		if (i < nb - 1)
>  			rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], void *));
> +
>  		mb[0] = pkts[i];
>  		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
>  
>  		/* Swap dest and src mac addresses. */
>  		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> +		/* MBUF and Ethernet are 2 separate cacheline */
> +		mbuf_field_set(mb[0], ol_flags);
>  		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
>  		_mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
> -
> -		mbuf_field_set(mb[0], ol_flags);
>  	}

Since final loop is only for the odd elements at the end of the array of
buffers. Does changing the instruction ordering here really make perf
difference?

next prev parent reply	other threads:[~2024-07-15 15:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-13 15:19 Vipin Varghese
2024-07-15 15:07 ` Bruce Richardson [this message]
2024-07-16  6:37 Vipin Varghese
2024-07-23 16:45 ` Ferruh Yigit
2024-07-23 17:12   ` Bruce Richardson
2024-07-25 12:47     ` Varghese, Vipin
2024-07-25 12:52       ` Bruce Richardson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZpU7QZWv0LmRoMYs@bricha3-mobl1.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@amd.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=vipin.varghese@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).