DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
@ 2015-11-08 19:39 harish.patil
  2015-11-13 10:35 ` Ananyev, Konstantin
  0 siblings, 1 reply; 4+ messages in thread
From: harish.patil @ 2015-11-08 19:39 UTC (permalink / raw)
  To: dev

From: Harish Patil <harish.patil@qlogic.com>

l3fwd app expects PMDs to return packets whose L2 header is
16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
intrinsics in the app. However, most of the protocol stacks expects
packets such that its IP/L3 header be aligned on a 16-byte boundary.

Based on the recommendations received on dpdk-dev, we are changing
the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
address need not be 16-byte aligned and thereby preventing crash.
We have tested that there is no performance impact due to this
change.

Signed-off-by: Harish Patil <harish.patil@qlogic.com>
---
 examples/l3fwd/main.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 1f3e5c6..4b8b754 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -1220,14 +1220,14 @@ process_packet(struct lcore_conf *qconf, struct rte_mbuf *pkt,
 	dst_ipv4 = rte_be_to_cpu_32(dst_ipv4);
 	dp = get_dst_port(qconf, pkt, dst_ipv4, portid);
 
-	te = _mm_load_si128((__m128i *)eth_hdr);
+	te = _mm_loadu_si128((__m128i *)eth_hdr);
 	ve = val_eth[dp];
 
 	dst_port[0] = dp;
 	rfc1812_process(ipv4_hdr, dst_port, pkt->packet_type);
 
 	te =  _mm_blend_epi16(te, ve, MASK_ETH);
-	_mm_store_si128((__m128i *)eth_hdr, te);
+	_mm_storeu_si128((__m128i *)eth_hdr, te);
 }
 
 /*
@@ -1313,16 +1313,16 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	p[3] = rte_pktmbuf_mtod(pkt[3], __m128i *);
 
 	ve[0] = val_eth[dst_port[0]];
-	te[0] = _mm_load_si128(p[0]);
+	te[0] = _mm_loadu_si128(p[0]);
 
 	ve[1] = val_eth[dst_port[1]];
-	te[1] = _mm_load_si128(p[1]);
+	te[1] = _mm_loadu_si128(p[1]);
 
 	ve[2] = val_eth[dst_port[2]];
-	te[2] = _mm_load_si128(p[2]);
+	te[2] = _mm_loadu_si128(p[2]);
 
 	ve[3] = val_eth[dst_port[3]];
-	te[3] = _mm_load_si128(p[3]);
+	te[3] = _mm_loadu_si128(p[3]);
 
 	/* Update first 12 bytes, keep rest bytes intact. */
 	te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
@@ -1330,10 +1330,10 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	te[2] =  _mm_blend_epi16(te[2], ve[2], MASK_ETH);
 	te[3] =  _mm_blend_epi16(te[3], ve[3], MASK_ETH);
 
-	_mm_store_si128(p[0], te[0]);
-	_mm_store_si128(p[1], te[1]);
-	_mm_store_si128(p[2], te[2]);
-	_mm_store_si128(p[3], te[3]);
+	_mm_storeu_si128(p[0], te[0]);
+	_mm_storeu_si128(p[1], te[1]);
+	_mm_storeu_si128(p[2], te[2]);
+	_mm_storeu_si128(p[3], te[3]);
 
 	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[0] + 1),
 		&dst_port[0], pkt[0]->packet_type);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
  2015-11-08 19:39 [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics harish.patil
@ 2015-11-13 10:35 ` Ananyev, Konstantin
  2015-11-16 18:16   ` Harish Patil
  2015-12-07  2:16   ` Thomas Monjalon
  0 siblings, 2 replies; 4+ messages in thread
From: Ananyev, Konstantin @ 2015-11-13 10:35 UTC (permalink / raw)
  To: harish.patil, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of harish.patil@qlogic.com
> Sent: Sunday, November 08, 2015 7:40 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
> 
> From: Harish Patil <harish.patil@qlogic.com>
> 
> l3fwd app expects PMDs to return packets whose L2 header is
> 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
> intrinsics in the app. However, most of the protocol stacks expects
> packets such that its IP/L3 header be aligned on a 16-byte boundary.
> 
> Based on the recommendations received on dpdk-dev, we are changing
> the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
> address need not be 16-byte aligned and thereby preventing crash.
> We have tested that there is no performance impact due to this
> change.
> 
> Signed-off-by: Harish Patil <harish.patil@qlogic.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

As a side notice:
In fact with gcc build I do see a slight regression: ~1%
for 4 ports over 1 core test-case.
Though I think the problem is not in the patch itself.
By some, unknown to me reason, gcc treats aligned and unaligned load/store
instrincts in a different way (at least for that particular case).
With aligned load/store in use it generates code that is pretty close to the source:
 4 loads first, then 4 BLENDs, then 4  stores  (with some interfering scalar instructions of course).
But with unaligned ones  gcc starts to mix loads and blends for the same register, so now it is:
load x0; blend x0; load x1; blend x1; .. 
As if the source code was:

te[0] = _mm_loadu_si128(p[0]);
te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
te[1] = _mm_loadu_si128(p[1]);
te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);  
...

So load latency is not hidden any more.
I tried it with different versions of  - same story for all of them.
Clang doesn't have such issue and generates similar code for both
aligned and unaligned instrincts. 

The only way to fix it I can think about  -  put rte_compiler_barrier() just before the first blend instinct.
That helped, now there are no noticeable differences in generated code and results before and after the patch.
 So I suppose, I'll have to submit a patch after yours one to fix that problem.
Konstantin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
  2015-11-13 10:35 ` Ananyev, Konstantin
@ 2015-11-16 18:16   ` Harish Patil
  2015-12-07  2:16   ` Thomas Monjalon
  1 sibling, 0 replies; 4+ messages in thread
From: Harish Patil @ 2015-11-16 18:16 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

>

>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
>>harish.patil@qlogic.com
>> Sent: Sunday, November 08, 2015 7:40 PM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned
>>load/store intrinsics
>>
>> From: Harish Patil <harish.patil@qlogic.com>
>>
>> l3fwd app expects PMDs to return packets whose L2 header is
>> 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
>> intrinsics in the app. However, most of the protocol stacks expects
>> packets such that its IP/L3 header be aligned on a 16-byte boundary.
>>
>> Based on the recommendations received on dpdk-dev, we are changing
>> the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
>> address need not be 16-byte aligned and thereby preventing crash.
>> We have tested that there is no performance impact due to this
>> change.
>>
>> Signed-off-by: Harish Patil <harish.patil@qlogic.com>
>> ---
>
>Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
>As a side notice:
>In fact with gcc build I do see a slight regression: ~1%
>for 4 ports over 1 core test-case.
>Though I think the problem is not in the patch itself.
>By some, unknown to me reason, gcc treats aligned and unaligned load/store
>instrincts in a different way (at least for that particular case).
>With aligned load/store in use it generates code that is pretty close to
>the source:
> 4 loads first, then 4 BLENDs, then 4  stores  (with some interfering
>scalar instructions of course).
>But with unaligned ones  gcc starts to mix loads and blends for the same
>register, so now it is:
>load x0; blend x0; load x1; blend x1; ..
>As if the source code was:
>
>te[0] = _mm_loadu_si128(p[0]);
>te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
>te[1] = _mm_loadu_si128(p[1]);
>te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);
>...
>
>So load latency is not hidden any more.
>I tried it with different versions of  - same story for all of them.
>Clang doesn't have such issue and generates similar code for both
>aligned and unaligned instrincts.
>
>The only way to fix it I can think about  -  put rte_compiler_barrier()
>just before the first blend instinct.
>That helped, now there are no noticeable differences in generated code
>and results before and after the patch.
> So I suppose, I'll have to submit a patch after yours one to fix that
>problem.
>Konstantin
>
>

Sure, thanks for verifying and providing fix.

Harish


________________________________

This message and any attached documents contain information from the sending company or its parent company(s), subsidiaries, divisions or branch offices that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
  2015-11-13 10:35 ` Ananyev, Konstantin
  2015-11-16 18:16   ` Harish Patil
@ 2015-12-07  2:16   ` Thomas Monjalon
  1 sibling, 0 replies; 4+ messages in thread
From: Thomas Monjalon @ 2015-12-07  2:16 UTC (permalink / raw)
  To: harish.patil; +Cc: dev

> > l3fwd app expects PMDs to return packets whose L2 header is
> > 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
> > intrinsics in the app. However, most of the protocol stacks expects
> > packets such that its IP/L3 header be aligned on a 16-byte boundary.
> > 
> > Based on the recommendations received on dpdk-dev, we are changing
> > the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
> > address need not be 16-byte aligned and thereby preventing crash.
> > We have tested that there is no performance impact due to this
> > change.
> > 
> > Signed-off-by: Harish Patil <harish.patil@qlogic.com>
> 
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Applied, thanks

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-12-07  2:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-08 19:39 [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics harish.patil
2015-11-13 10:35 ` Ananyev, Konstantin
2015-11-16 18:16   ` Harish Patil
2015-12-07  2:16   ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).