* [PATCH] app/testpmd: improve sse based macswap
@ 2024-07-16 6:37 Vipin Varghese
2024-07-23 16:45 ` Ferruh Yigit
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
0 siblings, 2 replies; 18+ messages in thread
From: Vipin Varghese @ 2024-07-16 6:37 UTC (permalink / raw)
To: dev, ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
vipin.varghese
Goal of the patch is to improve SSE macswap on x86_64 by reducing
the stalls in backend engine. Original implementation of the SSE
macswap makes loop call to multiple load, shuffle & store. Using
SIMD ISA interleaving we can reduce the stalls for
- load SSE token exhaustion
- Shuffle and Load dependency
Also other changes which improves packet per second are
- Filling access to MBUF for offload flags which is separate cacheline,
- using register keyword
Build test using meson script:
``````````````````````````````
build-gcc-static
buildtools
build-gcc-shared
build-mini
build-clang-static
build-clang-shared
build-x86-generic
Test Results:
`````````````
Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
------------------------------------------------
TEST IO 64B: baseline <NIC : MPPs>
- mellanox CX-7 2*200Gbps : 42.0
- intel E810 1*100Gbps : 82.0
- intel E810 2*200Gbps (2CQ-DA2): 82.45
------------------------------------------------
TEST MACSWAP 64B: <NIC : Before : After>
- mellanox CX-7 2*200Gbps : 31.533 : 31.90
- intel E810 1*100Gbps : 50.380 : 47.0
- intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
------------------------------------------------
TEST MACSWAP 128B: <NIC : Before: After>
- mellanox CX-7 2*200Gbps: 30.946 : 31.770
- intel E810 1*100Gbps: 49.386 : 46.366
- intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
------------------------------------------------
TEST MACSWAP 256B: <NIC: Before: After>
- mellanox CX-7 2*200Gbps: 32.480 : 33.150
- intel E810 1 * 100Gbps: 45.29 : 44.571
- intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
------------------------------------------------
Platform-2: AMD EPYC 9554 @3.1GHz, no boost
------------------------------------------------
TEST IO 64B: baseline <NIC : MPPs>
- intel E810 2*200Gbps (2CQ-DA2): 82.49
------------------------------------------------
<NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
TEST MACSWAP: 1Q 1C1T
64B: : 45.0 : 45.54
128B: : 44.48 : 44.43
256B: : 42.0 : 41.99
+++++++++++++++++++++++++
TEST MACSWAP: 2Q 2C2T
64B: : 59.5 : 60.55
128B: : 56.78 : 58.1
256B: : 41.85 : 41.99
------------------------------------------------
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
app/test-pmd/macswap_sse.h | 37 +++++++++++++++++--------------------
1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
index 223f87a539..6e4ed21924 100644
--- a/app/test-pmd/macswap_sse.h
+++ b/app/test-pmd/macswap_sse.h
@@ -16,16 +16,16 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
uint64_t ol_flags;
int i;
int r;
- __m128i addr0, addr1, addr2, addr3;
+ register __m128i addr0, addr1, addr2, addr3;
/**
* shuffle mask be used to shuffle the 16 bytes.
* byte 0-5 wills be swapped with byte 6-11.
* byte 12-15 will keep unchanged.
*/
- __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
- 5, 4, 3, 2,
- 1, 0, 11, 10,
- 9, 8, 7, 6);
+ register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
+ 5, 4, 3, 2,
+ 1, 0, 11, 10,
+ 9, 8, 7, 6);
ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
vlan_qinq_set(pkts, nb, ol_flags,
@@ -44,23 +44,24 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
mb[0] = pkts[i++];
eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
- addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
-
mb[1] = pkts[i++];
eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *);
- addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
-
-
mb[2] = pkts[i++];
eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
- addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
-
mb[3] = pkts[i++];
eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
- addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
+ /* Interleave loads and shuffle with field set */
+ addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+ mbuf_field_set(mb[0], ol_flags);
+ addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
+ mbuf_field_set(mb[1], ol_flags);
addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
+ addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
+ mbuf_field_set(mb[2], ol_flags);
addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
+ addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
+ mbuf_field_set(mb[3], ol_flags);
addr2 = _mm_shuffle_epi8(addr2, shfl_msk);
addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
@@ -69,25 +70,21 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
- mbuf_field_set(mb[0], ol_flags);
- mbuf_field_set(mb[1], ol_flags);
- mbuf_field_set(mb[2], ol_flags);
- mbuf_field_set(mb[3], ol_flags);
r -= 4;
}
for ( ; i < nb; i++) {
- if (i < nb - 1)
+ if (i < (nb - 1))
rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], void *));
mb[0] = pkts[i];
eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
/* Swap dest and src mac addresses. */
addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+ /* invoke field_set as it is on separate cacheline */
+ mbuf_field_set(mb[0], ol_flags);
addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
_mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
-
- mbuf_field_set(mb[0], ol_flags);
}
}
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] app/testpmd: improve sse based macswap
2024-07-16 6:37 [PATCH] app/testpmd: improve sse based macswap Vipin Varghese
@ 2024-07-23 16:45 ` Ferruh Yigit
2024-07-23 17:12 ` Bruce Richardson
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
1 sibling, 1 reply; 18+ messages in thread
From: Ferruh Yigit @ 2024-07-23 16:45 UTC (permalink / raw)
To: Vipin Varghese, dev, bruce.richardson, Mcnamara, John, Xu, HailinX
Cc: konstantin.v.ananyev
On 7/16/2024 7:37 AM, Vipin Varghese wrote:
> Goal of the patch is to improve SSE macswap on x86_64 by reducing
> the stalls in backend engine. Original implementation of the SSE
> macswap makes loop call to multiple load, shuffle & store. Using
> SIMD ISA interleaving we can reduce the stalls for
> - load SSE token exhaustion
> - Shuffle and Load dependency
>
> Also other changes which improves packet per second are
> - Filling access to MBUF for offload flags which is separate cacheline,
> - using register keyword
>
> Build test using meson script:
> ``````````````````````````````
>
> build-gcc-static
> buildtools
> build-gcc-shared
> build-mini
> build-clang-static
> build-clang-shared
> build-x86-generic
>
> Test Results:
> `````````````
>
> Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
>
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
> - mellanox CX-7 2*200Gbps : 42.0
> - intel E810 1*100Gbps : 82.0
> - intel E810 2*200Gbps (2CQ-DA2): 82.45
> ------------------------------------------------
> TEST MACSWAP 64B: <NIC : Before : After>
> - mellanox CX-7 2*200Gbps : 31.533 : 31.90
> - intel E810 1*100Gbps : 50.380 : 47.0
> - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
> ------------------------------------------------
> TEST MACSWAP 128B: <NIC : Before: After>
> - mellanox CX-7 2*200Gbps: 30.946 : 31.770
> - intel E810 1*100Gbps: 49.386 : 46.366
> - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
> ------------------------------------------------
> TEST MACSWAP 256B: <NIC: Before: After>
> - mellanox CX-7 2*200Gbps: 32.480 : 33.150
> - intel E810 1 * 100Gbps: 45.29 : 44.571
> - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
> ------------------------------------------------
>
> Platform-2: AMD EPYC 9554 @3.1GHz, no boost
>
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
> - intel E810 2*200Gbps (2CQ-DA2): 82.49
> ------------------------------------------------
> <NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
> TEST MACSWAP: 1Q 1C1T
> 64B: : 45.0 : 45.54
> 128B: : 44.48 : 44.43
> 256B: : 42.0 : 41.99
> +++++++++++++++++++++++++
> TEST MACSWAP: 2Q 2C2T
> 64B: : 59.5 : 60.55
> 128B: : 56.78 : 58.1
> 256B: : 41.85 : 41.99
> ------------------------------------------------
>
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
>
Hi Bruce, John,
Can you please help testing macswap performance with this patch on Intel
platforms, to be sure it is not causing regression?
Other option is to get this patch for -rc3 and tested there, with the
condition to remove it in any regression, if this help testing the patch?
Thanks,
ferruh
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] app/testpmd: improve sse based macswap
2024-07-23 16:45 ` Ferruh Yigit
@ 2024-07-23 17:12 ` Bruce Richardson
2024-07-25 12:47 ` Varghese, Vipin
0 siblings, 1 reply; 18+ messages in thread
From: Bruce Richardson @ 2024-07-23 17:12 UTC (permalink / raw)
To: Ferruh Yigit
Cc: Vipin Varghese, dev, Mcnamara, John, Xu, HailinX, konstantin.v.ananyev
On Tue, Jul 23, 2024 at 05:45:57PM +0100, Ferruh Yigit wrote:
> On 7/16/2024 7:37 AM, Vipin Varghese wrote:
> > Goal of the patch is to improve SSE macswap on x86_64 by reducing
> > the stalls in backend engine. Original implementation of the SSE
> > macswap makes loop call to multiple load, shuffle & store. Using
> > SIMD ISA interleaving we can reduce the stalls for
> > - load SSE token exhaustion
> > - Shuffle and Load dependency
> >
> > Also other changes which improves packet per second are
> > - Filling access to MBUF for offload flags which is separate cacheline,
> > - using register keyword
> >
> > Build test using meson script:
> > ``````````````````````````````
> >
> > build-gcc-static
> > buildtools
> > build-gcc-shared
> > build-mini
> > build-clang-static
> > build-clang-shared
> > build-x86-generic
> >
> > Test Results:
> > `````````````
> >
> > Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
> >
> > ------------------------------------------------
> > TEST IO 64B: baseline <NIC : MPPs>
> > - mellanox CX-7 2*200Gbps : 42.0
> > - intel E810 1*100Gbps : 82.0
> > - intel E810 2*200Gbps (2CQ-DA2): 82.45
> > ------------------------------------------------
> > TEST MACSWAP 64B: <NIC : Before : After>
> > - mellanox CX-7 2*200Gbps : 31.533 : 31.90
> > - intel E810 1*100Gbps : 50.380 : 47.0
> > - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
> > ------------------------------------------------
> > TEST MACSWAP 128B: <NIC : Before: After>
> > - mellanox CX-7 2*200Gbps: 30.946 : 31.770
> > - intel E810 1*100Gbps: 49.386 : 46.366
> > - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
> > ------------------------------------------------
> > TEST MACSWAP 256B: <NIC: Before: After>
> > - mellanox CX-7 2*200Gbps: 32.480 : 33.150
> > - intel E810 1 * 100Gbps: 45.29 : 44.571
> > - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
> > ------------------------------------------------
> >
> > Platform-2: AMD EPYC 9554 @3.1GHz, no boost
> >
> > ------------------------------------------------
> > TEST IO 64B: baseline <NIC : MPPs>
> > - intel E810 2*200Gbps (2CQ-DA2): 82.49
> > ------------------------------------------------
> > <NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
> > TEST MACSWAP: 1Q 1C1T
> > 64B: : 45.0 : 45.54
> > 128B: : 44.48 : 44.43
> > 256B: : 42.0 : 41.99
> > +++++++++++++++++++++++++
> > TEST MACSWAP: 2Q 2C2T
> > 64B: : 59.5 : 60.55
> > 128B: : 56.78 : 58.1
> > 256B: : 41.85 : 41.99
> > ------------------------------------------------
> >
> > Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
> >
>
> Hi Bruce, John,
>
> Can you please help testing macswap performance with this patch on Intel
> platforms, to be sure it is not causing regression?
>
Hi Ferruh,
We can try and get some Intel numbers for you, but I think at this point it
is better deferred to 24.11 due to lack of discussion and analysis of the
numbers. This is because the numbers above already show that it is causing
regressions - in fact many of the regressions are larger than the benefits
shown. This may be acceptable, but it would imply that we shouldn't be too
hasty in applying the patch.
Regards,
/Bruce
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] app/testpmd: improve sse based macswap
2024-07-23 17:12 ` Bruce Richardson
@ 2024-07-25 12:47 ` Varghese, Vipin
2024-07-25 12:52 ` Bruce Richardson
0 siblings, 1 reply; 18+ messages in thread
From: Varghese, Vipin @ 2024-07-25 12:47 UTC (permalink / raw)
To: Bruce Richardson, Ferruh Yigit
Cc: dev, Mcnamara, John, Xu, HailinX, konstantin.v.ananyev
[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]
Hi Bruce,
Thanks for highlighting the variance. We found this was an internal test
bed configuration issue. We are sharing the next version of the same
patch with updated numbers.
On 7/23/2024 10:42 PM, Bruce Richardson wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Tue, Jul 23, 2024 at 05:45:57PM +0100, Ferruh Yigit wrote:
>> On 7/16/2024 7:37 AM, Vipin Varghese wrote:
>>> Goal of the patch is to improve SSE macswap on x86_64 by reducing
>>> the stalls in backend engine. Original implementation of the SSE
>>> macswap makes loop call to multiple load, shuffle & store. Using
>>> SIMD ISA interleaving we can reduce the stalls for
>>> - load SSE token exhaustion
>>> - Shuffle and Load dependency
>>>
>>> Also other changes which improves packet per second are
>>> - Filling access to MBUF for offload flags which is separate cacheline,
>>> - using register keyword
>>>
>>> Build test using meson script:
>>> ``````````````````````````````
>>>
>>> build-gcc-static
>>> buildtools
>>> build-gcc-shared
>>> build-mini
>>> build-clang-static
>>> build-clang-shared
>>> build-x86-generic
>>>
>>> Test Results:
>>> `````````````
>>>
>>> Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
>>>
>>> ------------------------------------------------
>>> TEST IO 64B: baseline <NIC : MPPs>
>>> - mellanox CX-7 2*200Gbps : 42.0
>>> - intel E810 1*100Gbps : 82.0
>>> - intel E810 2*200Gbps (2CQ-DA2): 82.45
>>> ------------------------------------------------
>>> TEST MACSWAP 64B: <NIC : Before : After>
>>> - mellanox CX-7 2*200Gbps : 31.533 : 31.90
>>> - intel E810 1*100Gbps : 50.380 : 47.0
>>> - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
>>> ------------------------------------------------
>>> TEST MACSWAP 128B: <NIC : Before: After>
>>> - mellanox CX-7 2*200Gbps: 30.946 : 31.770
>>> - intel E810 1*100Gbps: 49.386 : 46.366
>>> - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
>>> ------------------------------------------------
>>> TEST MACSWAP 256B: <NIC: Before: After>
>>> - mellanox CX-7 2*200Gbps: 32.480 : 33.150
>>> - intel E810 1 * 100Gbps: 45.29 : 44.571
>>> - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
>>> ------------------------------------------------
>>>
>>> Platform-2: AMD EPYC 9554 @3.1GHz, no boost
>>>
>>> ------------------------------------------------
>>> TEST IO 64B: baseline <NIC : MPPs>
>>> - intel E810 2*200Gbps (2CQ-DA2): 82.49
>>> ------------------------------------------------
>>> <NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
>>> TEST MACSWAP: 1Q 1C1T
>>> 64B: : 45.0 : 45.54
>>> 128B: : 44.48 : 44.43
>>> 256B: : 42.0 : 41.99
>>> +++++++++++++++++++++++++
>>> TEST MACSWAP: 2Q 2C2T
>>> 64B: : 59.5 : 60.55
>>> 128B: : 56.78 : 58.1
>>> 256B: : 41.85 : 41.99
>>> ------------------------------------------------
>>>
>>> Signed-off-by: Vipin Varghese<vipin.varghese@amd.com>
>>>
>> Hi Bruce, John,
>>
>> Can you please help testing macswap performance with this patch on Intel
>> platforms, to be sure it is not causing regression?
>>
> Hi Ferruh,
>
> We can try and get some Intel numbers for you, but I think at this point it
> is better deferred to 24.11 due to lack of discussion and analysis of the
> numbers. This is because the numbers above already show that it is causing
> regressions - in fact many of the regressions are larger than the benefits
> shown. This may be acceptable, but it would imply that we shouldn't be too
> hasty in applying the patch.
>
> Regards,
> /Bruce
[-- Attachment #2: Type: text/html, Size: 4183 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] app/testpmd: improve sse based macswap
2024-07-25 12:47 ` Varghese, Vipin
@ 2024-07-25 12:52 ` Bruce Richardson
0 siblings, 0 replies; 18+ messages in thread
From: Bruce Richardson @ 2024-07-25 12:52 UTC (permalink / raw)
To: Varghese, Vipin
Cc: Ferruh Yigit, dev, Mcnamara, John, Xu, HailinX, konstantin.v.ananyev
On Thu, Jul 25, 2024 at 06:17:35PM +0530, Varghese, Vipin wrote:
> Hi Bruce,
> Thanks for highlighting the variance. We found this was an internal
> test bed configuration issue. We are sharing the next version of the
> same patch with updated numbers.
>
Great, thanks for the update.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 0/3] app/testpmd: improve sse based macswap
2024-07-16 6:37 [PATCH] app/testpmd: improve sse based macswap Vipin Varghese
2024-07-23 16:45 ` Ferruh Yigit
@ 2024-08-21 14:38 ` Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 1/3] app/testpmd: add register keyword Vipin Varghese
` (3 more replies)
1 sibling, 4 replies; 18+ messages in thread
From: Vipin Varghese @ 2024-08-21 14:38 UTC (permalink / raw)
To: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
Goal of the patch series is to improve SSE macswap on x86_64 by
reducing the stalls in backend engine. Original implementation of
the SSE-mac-swap makes loop call to multiple load, shuffle & store.
Using SIMD ISA interleaving, register variable and reducing L1 & L2
cache eviction, we can reduce the stalls for
- load SSE token exhaustion
- Shuffle and Load dependency
Build test using meson script:
``````````````````````````````
build-gcc-static
buildtools
build-gcc-shared
build-mini
build-clang-static
build-clang-shared
build-x86-generic
Test Results:
`````````````
Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
Platform-2: AMD EPYC 9554 @3.1GHz, no boost
NIC:
1) mellanox CX-7 1*200Gbps
2) intel E810 1*100Gbps
3) intel E810 2*200Gbps (2CQ-DA2) - loopback
4) braodcom P2100 2*100Gbps - loopback
------------------------------------------------
TEST IO 64B: baseline <NIC : MPPs>
- NIC-1: 42.0
- NIC-2: 82.0
- NIC-3: 82.45
- NIC-3: 47.03
------------------------------------------------
TEST MACSWAP 64B: <NIC : Before : After>
- NIC-1: 31.533 : 31.90
- NIC-2: 48.0 : 48.9
- NIC-3: 48.840 : 49.827
- NIC-4: 44.3 : 45.5
------------------------------------------------
TEST MACSWAP 128B: <NIC : Before: After>
- NIC-1: 30.946 : 31.770
- NIC-2: 47.4 : 48.3
- NIC-3: 47.979 : 48.503
- NIC-4: 41.53 : 44.59
------------------------------------------------
TEST MACSWAP 256B: <NIC: Before: After>
- NIC-1: 32.480 : 33.150
- NIC-2: 45.29 : 45.571
- NIC-3: 45.033 : 45.117
- NIC-4: 36.49 : 37.5
------------------------------------------------
------------------------------------------------
TEST IO 64B: baseline <NIC : MPPs>
- intel E810 2*200Gbps (2CQ-DA2): 82.49
------------------------------------------------
<NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
TEST MACSWAP: 1Q 1C1T
64B: : 45.0 : 45.54
128B: : 44.48 : 44.43
256B: : 42.0 : 41.99
+++++++++++++++++++++++++
TEST MACSWAP: 2Q 2C2T
64B: : 59.5 : 60.55
128B: : 56.78 : 58.1
256B: : 41.85 : 41.99
------------------------------------------------
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
Vipin Varghese (3):
app/testpmd: add register keyword
app/testpmd: move offload update
app/testpmd: interleave SSE SIMD
app/test-pmd/macswap_sse.h | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
@ 2024-08-21 14:38 ` Vipin Varghese
2024-08-21 14:55 ` Stephen Hemminger
2024-08-21 14:38 ` [PATCH v2 2/3] app/testpmd: move offload update Vipin Varghese
` (2 subsequent siblings)
3 siblings, 1 reply; 18+ messages in thread
From: Vipin Varghese @ 2024-08-21 14:38 UTC (permalink / raw)
To: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
Currently SSE SIMD variables are declared as stack variables. Allowing
the use of keyword register for shuffle mask and address variables,
improves the mac-swap Mpps by 1 for single queue.
Test Result:
* Platform: AMD EPYC 9554 @3.1GHz, no boost
* Test scenarios: TEST-PMD 64B IO vs MAC-SWAP
* NIC: broadcom P2100: loopback 2*100Gbps
<mode : Mpps Ingress: Mpps Egress>
------------------------------------------------
- IO: 47.23 : 46.0
- MAC-SWAP original: 45.75 : 43.8
- MAC-SWAP register mod 45.73 : 44.83
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
app/test-pmd/macswap_sse.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
index 223f87a539..29088843b7 100644
--- a/app/test-pmd/macswap_sse.h
+++ b/app/test-pmd/macswap_sse.h
@@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
uint64_t ol_flags;
int i;
int r;
- __m128i addr0, addr1, addr2, addr3;
+ register __m128i addr0, addr1, addr2, addr3;
/**
* shuffle mask be used to shuffle the 16 bytes.
* byte 0-5 wills be swapped with byte 6-11.
* byte 12-15 will keep unchanged.
*/
- __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
+ register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
5, 4, 3, 2,
1, 0, 11, 10,
9, 8, 7, 6);
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 2/3] app/testpmd: move offload update
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 1/3] app/testpmd: add register keyword Vipin Varghese
@ 2024-08-21 14:38 ` Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 3/3] app/testpmd: interleave SSE SIMD Vipin Varghese
2024-10-04 5:08 ` [PATCH v2 0/3] app/testpmd: improve sse based macswap Ferruh Yigit
3 siblings, 0 replies; 18+ messages in thread
From: Vipin Varghese @ 2024-08-21 14:38 UTC (permalink / raw)
To: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
Moving the offload flag update from end to start of the loop,
helps to reduce L1 or L2 cache evictions and amortize shuffle.
This helps to improve RX packet in mac-swap processing.
Test Result:
* Platform: AMD EPYC 9554 @3.1GHz, no boost
* Test scenarios: TEST-PMD 64B IO vs MAC-SWAP
* NIC: broadcom P2100: loopback 2*100Gbps
<mode : Mpps Ingress: Mpps Egress>
------------------------------------------------
- MAC-SWAP original: 45.75 : 43.8
- MAC-SWAP register mod: 45.73 : 44.83
- MAC-SWAP register+ofl modified: 46.36 : 44.79
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
app/test-pmd/macswap_sse.h | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
index 29088843b7..67ff7fdfbb 100644
--- a/app/test-pmd/macswap_sse.h
+++ b/app/test-pmd/macswap_sse.h
@@ -45,19 +45,22 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
mb[0] = pkts[i++];
eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+ mbuf_field_set(mb[0], ol_flags);
mb[1] = pkts[i++];
eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *);
addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
-
+ mbuf_field_set(mb[1], ol_flags);
mb[2] = pkts[i++];
eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
+ mbuf_field_set(mb[2], ol_flags);
mb[3] = pkts[i++];
eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
+ mbuf_field_set(mb[3], ol_flags);
addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
@@ -69,10 +72,6 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
- mbuf_field_set(mb[0], ol_flags);
- mbuf_field_set(mb[1], ol_flags);
- mbuf_field_set(mb[2], ol_flags);
- mbuf_field_set(mb[3], ol_flags);
r -= 4;
}
@@ -84,10 +83,10 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
/* Swap dest and src mac addresses. */
addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+ mbuf_field_set(mb[0], ol_flags);
+
addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
_mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
-
- mbuf_field_set(mb[0], ol_flags);
}
}
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 3/3] app/testpmd: interleave SSE SIMD
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 1/3] app/testpmd: add register keyword Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 2/3] app/testpmd: move offload update Vipin Varghese
@ 2024-08-21 14:38 ` Vipin Varghese
2024-10-04 5:08 ` [PATCH v2 0/3] app/testpmd: improve sse based macswap Ferruh Yigit
3 siblings, 0 replies; 18+ messages in thread
From: Vipin Varghese @ 2024-08-21 14:38 UTC (permalink / raw)
To: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
Interleaving SSE SIMD load, shuffle, and store, helps to
improve the overall mac-swapp Mpps for both RX and TX.
Test Result:
* Platform: AMD EPYC 9554 @3.1GHz, no boost
* Test scenarios: TEST-PMD 64B IO vs MAC-SWAP
* NIC: broadcom P2100: loopback 2*100Gbps
<mode : Mpps Ingress: Mpps Egress>
------------------------------------------------
- MAC-SWAP original: 45.75 : 43.8
- MAC-SWAP register mod: 45.73 : 44.83
- MAC-SWAP register+ofl mod: 46.36 : 44.79
- MAC-SWAP register+ofl+interleave mod: 46.0 : 45.1
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
app/test-pmd/macswap_sse.h | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
index 67ff7fdfbb..1f547388b7 100644
--- a/app/test-pmd/macswap_sse.h
+++ b/app/test-pmd/macswap_sse.h
@@ -52,23 +52,25 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
mbuf_field_set(mb[1], ol_flags);
+ addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
+
mb[2] = pkts[i++];
eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
mbuf_field_set(mb[2], ol_flags);
+ addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
+ _mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
+
mb[3] = pkts[i++];
eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
mbuf_field_set(mb[3], ol_flags);
- addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
- addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
addr2 = _mm_shuffle_epi8(addr2, shfl_msk);
- addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
-
- _mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
_mm_storeu_si128((__m128i *)eth_hdr[1], addr1);
+
+ addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-21 14:38 ` [PATCH v2 1/3] app/testpmd: add register keyword Vipin Varghese
@ 2024-08-21 14:55 ` Stephen Hemminger
2024-08-27 15:32 ` Varghese, Vipin
0 siblings, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2024-08-21 14:55 UTC (permalink / raw)
To: Vipin Varghese
Cc: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
On Wed, 21 Aug 2024 20:08:55 +0530
Vipin Varghese <vipin.varghese@amd.com> wrote:
> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
> index 223f87a539..29088843b7 100644
> --- a/app/test-pmd/macswap_sse.h
> +++ b/app/test-pmd/macswap_sse.h
> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
> uint64_t ol_flags;
> int i;
> int r;
> - __m128i addr0, addr1, addr2, addr3;
> + register __m128i addr0, addr1, addr2, addr3;
Some compilers treat register as a no-op. Are you sure? Did you check with godbolt.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-21 14:55 ` Stephen Hemminger
@ 2024-08-27 15:32 ` Varghese, Vipin
2024-08-27 17:39 ` Stephen Hemminger
0 siblings, 1 reply; 18+ messages in thread
From: Varghese, Vipin @ 2024-08-27 15:32 UTC (permalink / raw)
To: Stephen Hemminger
Cc: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]
On 8/21/2024 8:25 PM, Stephen Hemminger wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, 21 Aug 2024 20:08:55 +0530
> Vipin Varghese<vipin.varghese@amd.com> wrote:
>
>> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
>> index 223f87a539..29088843b7 100644
>> --- a/app/test-pmd/macswap_sse.h
>> +++ b/app/test-pmd/macswap_sse.h
>> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>> uint64_t ol_flags;
>> int i;
>> int r;
>> - __m128i addr0, addr1, addr2, addr3;
>> + register __m128i addr0, addr1, addr2, addr3;
> Some compilers treat register as a no-op. Are you sure? Did you check with godbolt.
Thank you Stephen, I have tested the code changes on Linux using GCC and
Clang compiler.
In both cases in Linux environment, we have seen the the values loaded
onto register `xmm`.
```
registerconst__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, 5, 4, 3, 2,
1, 0, 11, 10, 9, 8, 7, 6);
vmovdqaxmm0, xmmwordptr[rip+ .LCPI0_0]
```
Both cases we have performance improvement.
Can you please help us understand if we have missed out something?
[-- Attachment #2: Type: text/html, Size: 4172 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-27 15:32 ` Varghese, Vipin
@ 2024-08-27 17:39 ` Stephen Hemminger
2024-08-29 8:14 ` Varghese, Vipin
2024-09-03 11:52 ` Konstantin Ananyev
0 siblings, 2 replies; 18+ messages in thread
From: Stephen Hemminger @ 2024-08-27 17:39 UTC (permalink / raw)
To: Varghese, Vipin
Cc: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
On Tue, 27 Aug 2024 21:02:00 +0530
"Varghese, Vipin" <vipin.varghese@amd.com> wrote:
> On 8/21/2024 8:25 PM, Stephen Hemminger wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> >
> >
> > On Wed, 21 Aug 2024 20:08:55 +0530
> > Vipin Varghese<vipin.varghese@amd.com> wrote:
> >
> >> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
> >> index 223f87a539..29088843b7 100644
> >> --- a/app/test-pmd/macswap_sse.h
> >> +++ b/app/test-pmd/macswap_sse.h
> >> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
> >> uint64_t ol_flags;
> >> int i;
> >> int r;
> >> - __m128i addr0, addr1, addr2, addr3;
> >> + register __m128i addr0, addr1, addr2, addr3;
> > Some compilers treat register as a no-op. Are you sure? Did you check with godbolt.
>
> Thank you Stephen, I have tested the code changes on Linux using GCC and
> Clang compiler.
>
> In both cases in Linux environment, we have seen the the values loaded
> onto register `xmm`.
>
> ```
> registerconst__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, 5, 4, 3, 2,
> 1, 0, 11, 10, 9, 8, 7, 6);
> vmovdqaxmm0, xmmwordptr[rip+ .LCPI0_0]
>
> ```
>
> Both cases we have performance improvement.
>
>
> Can you please help us understand if we have missed out something?
Ok, not sure why compiler would not decide to already use a register here?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-27 17:39 ` Stephen Hemminger
@ 2024-08-29 8:14 ` Varghese, Vipin
2024-09-03 11:52 ` Konstantin Ananyev
1 sibling, 0 replies; 18+ messages in thread
From: Varghese, Vipin @ 2024-08-29 8:14 UTC (permalink / raw)
To: Stephen Hemminger
Cc: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
[-- Attachment #1: Type: text/plain, Size: 232 bytes --]
On 8/27/2024 11:09 PM, Stephen Hemminger wrote:
> not sure why compiler would not decide to already use a register here?
Hi Stephen, I totally agree with your point, but in practice it did not
use registers for code generation.
[-- Attachment #2: Type: text/html, Size: 584 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 1/3] app/testpmd: add register keyword
2024-08-27 17:39 ` Stephen Hemminger
2024-08-29 8:14 ` Varghese, Vipin
@ 2024-09-03 11:52 ` Konstantin Ananyev
2024-09-06 13:02 ` Varghese, Vipin
1 sibling, 1 reply; 18+ messages in thread
From: Konstantin Ananyev @ 2024-09-03 11:52 UTC (permalink / raw)
To: Stephen Hemminger, Varghese, Vipin
Cc: ferruh.yigit, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
> > >
> > >> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
> > >> index 223f87a539..29088843b7 100644
> > >> --- a/app/test-pmd/macswap_sse.h
> > >> +++ b/app/test-pmd/macswap_sse.h
> > >> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
> > >> uint64_t ol_flags;
> > >> int i;
> > >> int r;
> > >> - __m128i addr0, addr1, addr2, addr3;
> > >> + register __m128i addr0, addr1, addr2, addr3;
> > > Some compilers treat register as a no-op. Are you sure? Did you check with godbolt.
> >
> > Thank you Stephen, I have tested the code changes on Linux using GCC and
> > Clang compiler.
> >
> > In both cases in Linux environment, we have seen the the values loaded
> > onto register `xmm`.
> >
> > ```
> > registerconst__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, 5, 4, 3, 2,
> > 1, 0, 11, 10, 9, 8, 7, 6);
> > vmovdqaxmm0, xmmwordptr[rip+ .LCPI0_0]
Yep, that what I would probably expect: one time load before the loop starts, right?
Curious what exactly it would generate then if 'register' keyword is missed?
BTW, on my box, gcc-11 with '-O3 -msse4.2 ...' I am seeing expected behavior without 'register' keyword.
Is it some particular compiler version that misbehaves?
> >
> > ```
> >
> > Both cases we have performance improvement.
> >
> >
> > Can you please help us understand if we have missed out something?
>
> Ok, not sure why compiler would not decide to already use a register here?
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 1/3] app/testpmd: add register keyword
2024-09-03 11:52 ` Konstantin Ananyev
@ 2024-09-06 13:02 ` Varghese, Vipin
2024-10-04 5:04 ` Ferruh Yigit
0 siblings, 1 reply; 18+ messages in thread
From: Varghese, Vipin @ 2024-09-06 13:02 UTC (permalink / raw)
To: Konstantin Ananyev, Stephen Hemminger
Cc: Yigit, Ferruh, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]
[AMD Official Use Only - AMD Internal Distribution Only]
<snipped>
> > > >> --- a/app/test-pmd/macswap_sse.h
> > > >> +++ b/app/test-pmd/macswap_sse.h
> > > >> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t
> nb,
> > > >> uint64_t ol_flags;
> > > >> int i;
> > > >> int r;
> > > >> - __m128i addr0, addr1, addr2, addr3;
> > > >> + register __m128i addr0, addr1, addr2, addr3;
> > > > Some compilers treat register as a no-op. Are you sure? Did you check
> with godbolt.
> > >
> > > Thank you Stephen, I have tested the code changes on Linux using GCC
> > > and Clang compiler.
> > >
> > > In both cases in Linux environment, we have seen the the values
> > > loaded onto register `xmm`.
> > >
> > > ```
> > > registerconst__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, 5, 4,
> > > 3, 2, 1, 0, 11, 10, 9, 8, 7, 6); vmovdqaxmm0, xmmwordptr[rip+
> > > .LCPI0_0]
>
> Yep, that what I would probably expect: one time load before the loop starts,
> right?
> Curious what exactly it would generate then if 'register' keyword is missed?
> BTW, on my box, gcc-11 with '-O3 -msse4.2 ...' I am seeing expected
> behavior without 'register' keyword.
> Is it some particular compiler version that misbehaves?
Thank you, Konstantin, for this pointer. I have been trying this understand this a bit more internally. Here are my observations
1. shuf simd ISA works on XMM register only.
2. Any values from variables has to be loaded to `xmm` register before processing.
3. when compiled for `-march=native` with compiler not aware (SoC Arch gcc weights) without patch might have generating with ` movzx eax, BYTE PTR [rbp-48]`
4. when register keyword is applied for both shufl_mask and addr, the compiler generates trying to get the variables directly into xmm using ` vmovdqu (%rsi),%xmm1`
So, I think you are right, from gcc12.3 and gcc 13.1 which supports `-march=znver4` this problem will not come.
>
> > >
> > > ```
> > >
> > > Both cases we have performance improvement.
> > >
> > >
> > > Can you please help us understand if we have missed out something?
> >
> > Ok, not sure why compiler would not decide to already use a register here?
[-- Attachment #2: Type: text/html, Size: 7828 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/3] app/testpmd: add register keyword
2024-09-06 13:02 ` Varghese, Vipin
@ 2024-10-04 5:04 ` Ferruh Yigit
0 siblings, 0 replies; 18+ messages in thread
From: Ferruh Yigit @ 2024-10-04 5:04 UTC (permalink / raw)
To: Varghese, Vipin, Konstantin Ananyev, Stephen Hemminger
Cc: bruce.richardson, konstantin.v.ananyev, aman.deep.singh, dev
On 9/6/2024 2:02 PM, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> <snipped>
>
>> > > >> --- a/app/test-pmd/macswap_sse.h
>> > > >> +++ b/app/test-pmd/macswap_sse.h
>> > > >> @@ -16,13 +16,13 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t
>> nb,
>> > > >> uint64_t ol_flags;
>> > > >> int i;
>> > > >> int r;
>> > > >> - __m128i addr0, addr1, addr2, addr3;
>> > > >> + register __m128i addr0, addr1, addr2, addr3;
>> > > > Some compilers treat register as a no-op. Are you sure? Did you check
>> with godbolt.
>> > >
>> > > Thank you Stephen, I have tested the code changes on Linux using GCC
>> > > and Clang compiler.
>> > >
>> > > In both cases in Linux environment, we have seen the the values
>> > > loaded onto register `xmm`.
>> > >
>> > > ```
>> > > registerconst__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12, 5, 4,
>> > > 3, 2, 1, 0, 11, 10, 9, 8, 7, 6); vmovdqaxmm0, xmmwordptr[rip+
>> > > .LCPI0_0]
>>
>> Yep, that what I would probably expect: one time load before the loop starts,
>> right?
>> Curious what exactly it would generate then if 'register' keyword is missed?
>> BTW, on my box, gcc-11 with '-O3 -msse4.2 ...' I am seeing expected
>> behavior without 'register' keyword.
>> Is it some particular compiler version that misbehaves?
>
> Thank you, Konstantin, for this pointer. I have been trying this
> understand this a bit more internally. Here are my observations
>
> 1. shuf simd ISA works on XMM register only.
> 2. Any values from variables has to be loaded to `xmm` register before
> processing.
> 3. when compiled for `-march=native` with compiler not aware (SoC Arch
> gcc weights) without patch might have generating with `movzx eax, BYTE
> PTR [rbp-48]`
> 4. when register keyword is applied for both shufl_mask and addr, the
> compiler generates trying to get the variables directly into xmm using `
> vmovdqu (%rsi),%xmm1`
>
> So, I think you are right, from gcc12.3 and gcc 13.1 which supports `-
> march=znver4` this problem will not come.
>
Hi Konstantin, Stephen,
There is no negative impact of adding 'register' keyword, right? At
worst it is useless, but Vipin can demonstrate that it has benefit for
some cases, so I think OK to get it.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 0/3] app/testpmd: improve sse based macswap
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
` (2 preceding siblings ...)
2024-08-21 14:38 ` [PATCH v2 3/3] app/testpmd: interleave SSE SIMD Vipin Varghese
@ 2024-10-04 5:08 ` Ferruh Yigit
2024-10-08 1:12 ` Ferruh Yigit
3 siblings, 1 reply; 18+ messages in thread
From: Ferruh Yigit @ 2024-10-04 5:08 UTC (permalink / raw)
To: Vipin Varghese, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
On 8/21/2024 3:38 PM, Vipin Varghese wrote:
> Goal of the patch series is to improve SSE macswap on x86_64 by
> reducing the stalls in backend engine. Original implementation of
> the SSE-mac-swap makes loop call to multiple load, shuffle & store.
>
> Using SIMD ISA interleaving, register variable and reducing L1 & L2
> cache eviction, we can reduce the stalls for
> - load SSE token exhaustion
> - Shuffle and Load dependency
>
> Build test using meson script:
> ``````````````````````````````
> build-gcc-static
> buildtools
> build-gcc-shared
> build-mini
> build-clang-static
> build-clang-shared
> build-x86-generic
>
> Test Results:
> `````````````
>
> Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
> Platform-2: AMD EPYC 9554 @3.1GHz, no boost
>
> NIC:
> 1) mellanox CX-7 1*200Gbps
> 2) intel E810 1*100Gbps
> 3) intel E810 2*200Gbps (2CQ-DA2) - loopback
> 4) braodcom P2100 2*100Gbps - loopback
>
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
> - NIC-1: 42.0
> - NIC-2: 82.0
> - NIC-3: 82.45
> - NIC-3: 47.03
> ------------------------------------------------
> TEST MACSWAP 64B: <NIC : Before : After>
> - NIC-1: 31.533 : 31.90
> - NIC-2: 48.0 : 48.9
> - NIC-3: 48.840 : 49.827
> - NIC-4: 44.3 : 45.5
> ------------------------------------------------
> TEST MACSWAP 128B: <NIC : Before: After>
> - NIC-1: 30.946 : 31.770
> - NIC-2: 47.4 : 48.3
> - NIC-3: 47.979 : 48.503
> - NIC-4: 41.53 : 44.59
> ------------------------------------------------
> TEST MACSWAP 256B: <NIC: Before: After>
> - NIC-1: 32.480 : 33.150
> - NIC-2: 45.29 : 45.571
> - NIC-3: 45.033 : 45.117
> - NIC-4: 36.49 : 37.5
> ------------------------------------------------
>
>
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
> - intel E810 2*200Gbps (2CQ-DA2): 82.49
> ------------------------------------------------
> <NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
> TEST MACSWAP: 1Q 1C1T
> 64B: : 45.0 : 45.54
> 128B: : 44.48 : 44.43
> 256B: : 42.0 : 41.99
> +++++++++++++++++++++++++
> TEST MACSWAP: 2Q 2C2T
> 64B: : 59.5 : 60.55
> 128B: : 56.78 : 58.1
> 256B: : 41.85 : 41.99
> ------------------------------------------------
>
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
>
> Vipin Varghese (3):
> app/testpmd: add register keyword
> app/testpmd: move offload update
> app/testpmd: interleave SSE SIMD
>
For series,
Acked-by: Ferruh Yigit <ferruh.yigit@amd.com>
Bruce, if your testing is not aligned with the presented results please
let us know, otherwise I am planning to get the patch for -rc1.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 0/3] app/testpmd: improve sse based macswap
2024-10-04 5:08 ` [PATCH v2 0/3] app/testpmd: improve sse based macswap Ferruh Yigit
@ 2024-10-08 1:12 ` Ferruh Yigit
0 siblings, 0 replies; 18+ messages in thread
From: Ferruh Yigit @ 2024-10-08 1:12 UTC (permalink / raw)
To: Vipin Varghese, bruce.richardson, konstantin.v.ananyev,
aman.deep.singh, dev
On 10/4/2024 6:08 AM, Ferruh Yigit wrote:
> On 8/21/2024 3:38 PM, Vipin Varghese wrote:
>> Goal of the patch series is to improve SSE macswap on x86_64 by
>> reducing the stalls in backend engine. Original implementation of
>> the SSE-mac-swap makes loop call to multiple load, shuffle & store.
>>
>> Using SIMD ISA interleaving, register variable and reducing L1 & L2
>> cache eviction, we can reduce the stalls for
>> - load SSE token exhaustion
>> - Shuffle and Load dependency
>>
>> Build test using meson script:
>> ``````````````````````````````
>> build-gcc-static
>> buildtools
>> build-gcc-shared
>> build-mini
>> build-clang-static
>> build-clang-shared
>> build-x86-generic
>>
>> Test Results:
>> `````````````
>>
>> Platform-1: AMD EPYC SIENA 8594P @2.3GHz, no boost
>> Platform-2: AMD EPYC 9554 @3.1GHz, no boost
>>
>> NIC:
>> 1) mellanox CX-7 1*200Gbps
>> 2) intel E810 1*100Gbps
>> 3) intel E810 2*200Gbps (2CQ-DA2) - loopback
>> 4) braodcom P2100 2*100Gbps - loopback
>>
>> ------------------------------------------------
>> TEST IO 64B: baseline <NIC : MPPs>
>> - NIC-1: 42.0
>> - NIC-2: 82.0
>> - NIC-3: 82.45
>> - NIC-3: 47.03
>> ------------------------------------------------
>> TEST MACSWAP 64B: <NIC : Before : After>
>> - NIC-1: 31.533 : 31.90
>> - NIC-2: 48.0 : 48.9
>> - NIC-3: 48.840 : 49.827
>> - NIC-4: 44.3 : 45.5
>> ------------------------------------------------
>> TEST MACSWAP 128B: <NIC : Before: After>
>> - NIC-1: 30.946 : 31.770
>> - NIC-2: 47.4 : 48.3
>> - NIC-3: 47.979 : 48.503
>> - NIC-4: 41.53 : 44.59
>> ------------------------------------------------
>> TEST MACSWAP 256B: <NIC: Before: After>
>> - NIC-1: 32.480 : 33.150
>> - NIC-2: 45.29 : 45.571
>> - NIC-3: 45.033 : 45.117
>> - NIC-4: 36.49 : 37.5
>> ------------------------------------------------
>>
>>
>> ------------------------------------------------
>> TEST IO 64B: baseline <NIC : MPPs>
>> - intel E810 2*200Gbps (2CQ-DA2): 82.49
>> ------------------------------------------------
>> <NIC intel E810 2*200Gbps (2CQ-DA2): Before : After>
>> TEST MACSWAP: 1Q 1C1T
>> 64B: : 45.0 : 45.54
>> 128B: : 44.48 : 44.43
>> 256B: : 42.0 : 41.99
>> +++++++++++++++++++++++++
>> TEST MACSWAP: 2Q 2C2T
>> 64B: : 59.5 : 60.55
>> 128B: : 56.78 : 58.1
>> 256B: : 41.85 : 41.99
>> ------------------------------------------------
>>
>> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
>>
>> Vipin Varghese (3):
>> app/testpmd: add register keyword
>> app/testpmd: move offload update
>> app/testpmd: interleave SSE SIMD
>>
>
> For series,
> Acked-by: Ferruh Yigit <ferruh.yigit@amd.com>
>
>
> Bruce, if your testing is not aligned with the presented results please
> let us know, otherwise I am planning to get the patch for -rc1.
>
Series applied to dpdk-next-net/main, thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-10-08 1:12 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-16 6:37 [PATCH] app/testpmd: improve sse based macswap Vipin Varghese
2024-07-23 16:45 ` Ferruh Yigit
2024-07-23 17:12 ` Bruce Richardson
2024-07-25 12:47 ` Varghese, Vipin
2024-07-25 12:52 ` Bruce Richardson
2024-08-21 14:38 ` [PATCH v2 0/3] " Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 1/3] app/testpmd: add register keyword Vipin Varghese
2024-08-21 14:55 ` Stephen Hemminger
2024-08-27 15:32 ` Varghese, Vipin
2024-08-27 17:39 ` Stephen Hemminger
2024-08-29 8:14 ` Varghese, Vipin
2024-09-03 11:52 ` Konstantin Ananyev
2024-09-06 13:02 ` Varghese, Vipin
2024-10-04 5:04 ` Ferruh Yigit
2024-08-21 14:38 ` [PATCH v2 2/3] app/testpmd: move offload update Vipin Varghese
2024-08-21 14:38 ` [PATCH v2 3/3] app/testpmd: interleave SSE SIMD Vipin Varghese
2024-10-04 5:08 ` [PATCH v2 0/3] app/testpmd: improve sse based macswap Ferruh Yigit
2024-10-08 1:12 ` Ferruh Yigit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).