DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
@ 2016-12-19  6:09 Jianbo Liu
  2016-12-19  6:09 ` [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Jianbo Liu @ 2016-12-19  6:09 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

To get better performance, Rx bulk alloc recv function will scan 8 descriptors
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scanning DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b2d9f45..2866bdb 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = LOOK_AHEAD - 1; j >= 0; --j) {
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
-
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
 			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
 						       lo_dword.data);
+		}
+
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
 		nb_dd = 0;
 		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+			if (s[j] & IXGBE_RXDADV_STAT_DD)
+				++nb_dd;
+			else
+				break;
 
 		nb_rx += nb_dd;
 
-- 
2.4.11

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
@ 2016-12-19  6:09 ` Jianbo Liu
  2016-12-21 10:08   ` Jerin Jacob
  2017-02-01 16:19 ` [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Jianbo Liu @ 2016-12-19  6:09 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

vPMD will check 4 descriptors in one time, but the statuses are not consistent
because the memory allocated for RX descriptors is cacheable huagepage.
This patch is to calculate the number of received packets by scanning DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index f96cc85..0b1338d 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -196,7 +196,6 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 	struct ixgbe_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
-	uint64_t var;
 	uint8x16_t shuf_msk = {
 		0xFF, 0xFF,
 		0xFF, 0xFF,  /* skip 32 bits pkt_type */
@@ -255,6 +254,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		uint64x2_t mbp1, mbp2;
 		uint8x16_t staterr;
 		uint16x8_t tmp;
+		uint32_t var = 0;
 		uint32_t stat;
 
 		/* B.1 load 1 mbuf point */
@@ -349,11 +349,19 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
 			 pkt_mb1);
 
+		stat &= IXGBE_VPMD_DESC_DD_MASK;
+
 		/* C.4 calc avaialbe number of desc */
-		var =  __builtin_popcount(stat & IXGBE_VPMD_DESC_DD_MASK);
-		nb_pkts_recd += var;
-		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
+		if (likely(var != IXGBE_VPMD_DESC_DD_MASK)) {
+			while (stat & 0x01) {
+				++var;
+				stat = stat >> 8;
+			}
+			nb_pkts_recd += var;
 			break;
+		} else {
+			nb_pkts_recd += RTE_IXGBE_DESCS_PER_LOOP;
+		}
 	}
 
 	/* Update our internal tail pointer */
-- 
2.4.11

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2016-12-19  6:09 ` [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2016-12-21 10:08   ` Jerin Jacob
  2016-12-21 11:03     ` Bruce Richardson
  2016-12-22  1:05     ` Jianbo Liu
  0 siblings, 2 replies; 25+ messages in thread
From: Jerin Jacob @ 2016-12-21 10:08 UTC (permalink / raw)
  To: Jianbo Liu; +Cc: dev, helin.zhang, konstantin.ananyev

On Mon, Dec 19, 2016 at 11:39:18AM +0530, Jianbo Liu wrote:

Hi Jianbo,

> vPMD will check 4 descriptors in one time, but the statuses are not consistent
> because the memory allocated for RX descriptors is cacheable huagepage.
Is it different in X86 case ?i.e Is x86 creating non cacheable hugepages?
I am just looking at what it takes to fix similar issues for all drivers wrt armv8.

Are you able to reproduce this issue any armv8 platform. If so, could
you please the platform detail and commands to reproduce this issue?

> This patch is to calculate the number of received packets by scanning DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
>  drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
> index f96cc85..0b1338d 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
> @@ -196,7 +196,6 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>  	struct ixgbe_rx_entry *sw_ring;
>  	uint16_t nb_pkts_recd;
>  	int pos;
> -	uint64_t var;
>  	uint8x16_t shuf_msk = {
>  		0xFF, 0xFF,
>  		0xFF, 0xFF,  /* skip 32 bits pkt_type */
> @@ -255,6 +254,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>  		uint64x2_t mbp1, mbp2;
>  		uint8x16_t staterr;
>  		uint16x8_t tmp;
> +		uint32_t var = 0;
>  		uint32_t stat;
>  
>  		/* B.1 load 1 mbuf point */
> @@ -349,11 +349,19 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>  		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
>  			 pkt_mb1);
>  
> +		stat &= IXGBE_VPMD_DESC_DD_MASK;
> +
>  		/* C.4 calc avaialbe number of desc */
> -		var =  __builtin_popcount(stat & IXGBE_VPMD_DESC_DD_MASK);
> -		nb_pkts_recd += var;
> -		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
> +		if (likely(var != IXGBE_VPMD_DESC_DD_MASK)) {
> +			while (stat & 0x01) {
> +				++var;
> +				stat = stat >> 8;
> +			}
> +			nb_pkts_recd += var;
>  			break;
> +		} else {
> +			nb_pkts_recd += RTE_IXGBE_DESCS_PER_LOOP;
> +		}
>  	}
>  
>  	/* Update our internal tail pointer */
> -- 
> 2.4.11
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2016-12-21 10:08   ` Jerin Jacob
@ 2016-12-21 11:03     ` Bruce Richardson
  2016-12-22  1:18       ` Jianbo Liu
  2016-12-22  1:05     ` Jianbo Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Bruce Richardson @ 2016-12-21 11:03 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jianbo Liu, dev, helin.zhang, konstantin.ananyev

On Wed, Dec 21, 2016 at 03:38:51PM +0530, Jerin Jacob wrote:
> On Mon, Dec 19, 2016 at 11:39:18AM +0530, Jianbo Liu wrote:
> 
> Hi Jianbo,
> 
> > vPMD will check 4 descriptors in one time, but the statuses are not consistent
> > because the memory allocated for RX descriptors is cacheable huagepage.
> Is it different in X86 case ?i.e Is x86 creating non cacheable hugepages?

This is not a problem on IA, because the instruction ordering rules on
IA guarantee that the reads will be done in the correct program order,
and we never get stale cache data.

/Bruce

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2016-12-21 10:08   ` Jerin Jacob
  2016-12-21 11:03     ` Bruce Richardson
@ 2016-12-22  1:05     ` Jianbo Liu
  1 sibling, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2016-12-22  1:05 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, Zhang, Helin, Ananyev, Konstantin

Hi Jerin,

On 21 December 2016 at 18:08, Jerin Jacob
<jerin.jacob@caviumnetworks.com> wrote:
> On Mon, Dec 19, 2016 at 11:39:18AM +0530, Jianbo Liu wrote:
>
> Hi Jianbo,
>
>> vPMD will check 4 descriptors in one time, but the statuses are not consistent
>> because the memory allocated for RX descriptors is cacheable huagepage.
> Is it different in X86 case ?i.e Is x86 creating non cacheable hugepages?
> I am just looking at what it takes to fix similar issues for all drivers wrt armv8.
>
> Are you able to reproduce this issue any armv8 platform. If so, could
> you please the platform detail and commands to reproduce this issue?
>

I have tested on Huawei D03 and Softiron with Intel X540, same issue
for both of them.
The setup is very simple: loopback 2 ports, then run testpmd.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2016-12-21 11:03     ` Bruce Richardson
@ 2016-12-22  1:18       ` Jianbo Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2016-12-22  1:18 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Jerin Jacob, dev, Zhang, Helin, Ananyev, Konstantin

On 21 December 2016 at 19:03, Bruce Richardson
<bruce.richardson@intel.com> wrote:
> On Wed, Dec 21, 2016 at 03:38:51PM +0530, Jerin Jacob wrote:
>> On Mon, Dec 19, 2016 at 11:39:18AM +0530, Jianbo Liu wrote:
>>
>> Hi Jianbo,
>>
>> > vPMD will check 4 descriptors in one time, but the statuses are not consistent
>> > because the memory allocated for RX descriptors is cacheable huagepage.
>> Is it different in X86 case ?i.e Is x86 creating non cacheable hugepages?
>
> This is not a problem on IA, because the instruction ordering rules on
> IA guarantee that the reads will be done in the correct program order,
> and we never get stale cache data.
>

Yes, I think it's an issue for ARM arch.
It's because more than one cacheline-sized data (4/8 descriptors can
be in two cachelines) will be read at one time in bulk alloc RX or
vPMD.
There is the same issue for i40e, I'll send the same patch later.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
  2016-12-19  6:09 ` [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2017-02-01 16:19 ` Ananyev, Konstantin
  2017-02-03  6:22   ` Jianbo Liu
  2017-02-04  9:37 ` [dpdk-dev] [PATCH v2 " Jianbo Liu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Ananyev, Konstantin @ 2017-02-01 16:19 UTC (permalink / raw)
  To: Jianbo Liu, dev, Zhang, Helin, jerin.jacob

Hi,

> -----Original Message-----
> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> Sent: Monday, December 19, 2016 6:09 AM
> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerin.jacob@caviumnetworks.com
> Cc: Jianbo Liu <jianbo.liu@linaro.org>
> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
> 
> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
> in one time, but the statuses are not consistent on ARM platform because
> the memory allocated for Rx descriptors is cacheable hugepages.
> This patch is to calculate the number of received packets by scanning DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index b2d9f45..2866bdb 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>  		/* Read desc statuses backwards to avoid race condition */
> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
> +		for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> -
> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
>  			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>  						       lo_dword.data);
> +		}
> +
> +		rte_smp_rmb();

If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?
As another nit - with rmb() in and because you are looking the first gap in s[] now,
no need to read TXDs in backward order.
How it looks to me (as a suggestion):

for (j = 0; j != LOOK_AHEAD; j++)
	s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

rte_smp_rmb();

for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
	;

for (j = 0; j < nb_dd; ++j) {
	pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
               ....

Konstantin


> 
>  		/* Compute how many status bits were set */
>  		nb_dd = 0;
>  		for (j = 0; j < LOOK_AHEAD; ++j)
> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> +			if (s[j] & IXGBE_RXDADV_STAT_DD)
> +				++nb_dd;
> +			else
> +				break;
> 
>  		nb_rx += nb_dd;
> 
> --
> 2.4.11

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-01 16:19 ` [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
@ 2017-02-03  6:22   ` Jianbo Liu
  2017-02-03 11:38     ` Ananyev, Konstantin
  0 siblings, 1 reply; 25+ messages in thread
From: Jianbo Liu @ 2017-02-03  6:22 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Zhang, Helin, jerin.jacob

On 2 February 2017 at 00:19, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
> Hi,
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Monday, December 19, 2016 6:09 AM
>> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
>> jerin.jacob@caviumnetworks.com
>> Cc: Jianbo Liu <jianbo.liu@linaro.org>
>> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
>> in one time, but the statuses are not consistent on ARM platform because
>> the memory allocated for Rx descriptors is cacheable hugepages.
>> This patch is to calculate the number of received packets by scanning DD bit
>> sequentially, and stops when meeting the first packet with DD bit unset.
>>
>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> ---
>>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index b2d9f45..2866bdb 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>>               /* Read desc statuses backwards to avoid race condition */
>> -             for (j = LOOK_AHEAD-1; j >= 0; --j)
>> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> -
>> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)
>>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>>                                                      lo_dword.data);
>> +             }
>> +
>> +             rte_smp_rmb();
>
> If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?

The barrier is to forbid the reordering from the following readings,
which will count the number of actual received packets.
And as wb.uper and wb.lower of one descriptor are in the same
cacheline, could it be better to read them at the same time?.

> As another nit - with rmb() in and because you are looking the first gap in s[] now,
> no need to read TXDs in backward order.

Reading backward is just to keep as it is for x86 platform.

> How it looks to me (as a suggestion):
>
> for (j = 0; j != LOOK_AHEAD; j++)
>         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>
> rte_smp_rmb();
>
> for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
>         ;
>
> for (j = 0; j < nb_dd; ++j) {
>         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
>                ....
>
> Konstantin
>
>
>>
>>               /* Compute how many status bits were set */
>>               nb_dd = 0;
>>               for (j = 0; j < LOOK_AHEAD; ++j)
>> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)
>> +                             ++nb_dd;
>> +                     else
>> +                             break;
>>
>>               nb_rx += nb_dd;
>>
>> --
>> 2.4.11
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-03  6:22   ` Jianbo Liu
@ 2017-02-03 11:38     ` Ananyev, Konstantin
  2017-02-04  3:37       ` Jianbo Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Ananyev, Konstantin @ 2017-02-03 11:38 UTC (permalink / raw)
  To: Jianbo Liu; +Cc: dev, Zhang, Helin, jerin.jacob



> -----Original Message-----
> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> Sent: Friday, February 3, 2017 6:22 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; jerin.jacob@caviumnetworks.com
> Subject: Re: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
> 
> On 2 February 2017 at 00:19, Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> > Hi,
> >
> >> -----Original Message-----
> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> >> Sent: Monday, December 19, 2016 6:09 AM
> >> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> >> jerin.jacob@caviumnetworks.com
> >> Cc: Jianbo Liu <jianbo.liu@linaro.org>
> >> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
> >>
> >> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
> >> in one time, but the statuses are not consistent on ARM platform because
> >> the memory allocated for Rx descriptors is cacheable hugepages.
> >> This patch is to calculate the number of received packets by scanning DD bit
> >> sequentially, and stops when meeting the first packet with DD bit unset.
> >>
> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> >> ---
> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
> >>  1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index b2d9f45..2866bdb 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
> >>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
> >>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
> >>               /* Read desc statuses backwards to avoid race condition */
> >> -             for (j = LOOK_AHEAD-1; j >= 0; --j)
> >> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {
> >>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> >> -
> >> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)
> >>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> >>                                                      lo_dword.data);
> >> +             }
> >> +
> >> +             rte_smp_rmb();
> >
> > If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?
> 
> The barrier is to forbid the reordering from the following readings,
> which will count the number of actual received packets.

What I meant is that if you'll keep reading from both rxdp[].wb.lower and rxdp[].wb.upper
before rmb, then nothing would prevent cpu from reorder these reads in any way it likes
(if we are talking about cpus with read reordering allowed), right?
So it can end up with the following order:

rxdp[N].wb.lower
rxdp[N].wb.upper

or even:

rxdp[N-1].wb.lower
rxdp[N].wb.lower
rxdp[N-1].wb.upper
rxdp[N].wb.upper

In such cases pkt_info[] may contain invalid data.

> And as wb.uper and wb.lower of one descriptor are in the same
> cacheline, could it be better to read them at the same time?.

It could be, but I think for the sake of data integrity we have to make sure that 
cpu would never read any other RXD field before wb.upper. status_error, see above.

BTW, the following code might re-read both wb.upper and wb.lower anyway.
So I don't think you'll save many cycles here anyway. 

> 
> > As another nit - with rmb() in and because you are looking the first gap in s[] now,
> > no need to read TXDs in backward order.
> 
> Reading backward is just to keep as it is for x86 platform.

With the change you introducing, I don't think it is necessary any more.

Konstantin

> 
> > How it looks to me (as a suggestion):
> >
> > for (j = 0; j != LOOK_AHEAD; j++)
> >         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> >
> > rte_smp_rmb();
> >
> > for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
> >         ;
> >
> > for (j = 0; j < nb_dd; ++j) {
> >         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
> >                ....
> >
> > Konstantin
> >
> >
> >>
> >>               /* Compute how many status bits were set */
> >>               nb_dd = 0;
> >>               for (j = 0; j < LOOK_AHEAD; ++j)
> >> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> >> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)
> >> +                             ++nb_dd;
> >> +                     else
> >> +                             break;
> >>
> >>               nb_rx += nb_dd;
> >>
> >> --
> >> 2.4.11
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-03 11:38     ` Ananyev, Konstantin
@ 2017-02-04  3:37       ` Jianbo Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04  3:37 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Zhang, Helin, jerin.jacob

On 3 February 2017 at 19:38, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Friday, February 3, 2017 6:22 AM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
>> Cc: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; jerin.jacob@caviumnetworks.com
>> Subject: Re: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> On 2 February 2017 at 00:19, Ananyev, Konstantin
>> <konstantin.ananyev@intel.com> wrote:
>> > Hi,
>> >
>> >> -----Original Message-----
>> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> >> Sent: Monday, December 19, 2016 6:09 AM
>> >> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
>> >> jerin.jacob@caviumnetworks.com
>> >> Cc: Jianbo Liu <jianbo.liu@linaro.org>
>> >> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>> >>
>> >> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
>> >> in one time, but the statuses are not consistent on ARM platform because
>> >> the memory allocated for Rx descriptors is cacheable hugepages.
>> >> This patch is to calculate the number of received packets by scanning DD bit
>> >> sequentially, and stops when meeting the first packet with DD bit unset.
>> >>
>> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> >> ---
>> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>> >>  1 file changed, 8 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> index b2d9f45..2866bdb 100644
>> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>> >>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>> >>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>> >>               /* Read desc statuses backwards to avoid race condition */
>> >> -             for (j = LOOK_AHEAD-1; j >= 0; --j)
>> >> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>> >>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> >> -
>> >> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)
>> >>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> >>                                                      lo_dword.data);
>> >> +             }
>> >> +
>> >> +             rte_smp_rmb();
>> >
>> > If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?
>>
>> The barrier is to forbid the reordering from the following readings,
>> which will count the number of actual received packets.
>
> What I meant is that if you'll keep reading from both rxdp[].wb.lower and rxdp[].wb.upper
> before rmb, then nothing would prevent cpu from reorder these reads in any way it likes
> (if we are talking about cpus with read reordering allowed), right?
> So it can end up with the following order:
>
> rxdp[N].wb.lower
> rxdp[N].wb.upper
>
> or even:
>
> rxdp[N-1].wb.lower
> rxdp[N].wb.lower
> rxdp[N-1].wb.upper
> rxdp[N].wb.upper
>
> In such cases pkt_info[] may contain invalid data.

Yes, it's possible. I'll send v2.

Thanks!

>
>> And as wb.uper and wb.lower of one descriptor are in the same
>> cacheline, could it be better to read them at the same time?.
>
> It could be, but I think for the sake of data integrity we have to make sure that
> cpu would never read any other RXD field before wb.upper. status_error, see above.
>
> BTW, the following code might re-read both wb.upper and wb.lower anyway.
> So I don't think you'll save many cycles here anyway.
>
>>
>> > As another nit - with rmb() in and because you are looking the first gap in s[] now,
>> > no need to read TXDs in backward order.
>>
>> Reading backward is just to keep as it is for x86 platform.
>
> With the change you introducing, I don't think it is necessary any more.
>
> Konstantin
>
>>
>> > How it looks to me (as a suggestion):
>> >
>> > for (j = 0; j != LOOK_AHEAD; j++)
>> >         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> >
>> > rte_smp_rmb();
>> >
>> > for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
>> >         ;
>> >
>> > for (j = 0; j < nb_dd; ++j) {
>> >         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
>> >                ....
>> >
>> > Konstantin
>> >
>> >
>> >>
>> >>               /* Compute how many status bits were set */
>> >>               nb_dd = 0;
>> >>               for (j = 0; j < LOOK_AHEAD; ++j)
>> >> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> >> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)
>> >> +                             ++nb_dd;
>> >> +                     else
>> >> +                             break;
>> >>
>> >>               nb_rx += nb_dd;
>> >>
>> >> --
>> >> 2.4.11
>> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
  2016-12-19  6:09 ` [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
  2017-02-01 16:19 ` [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
@ 2017-02-04  9:37 ` Jianbo Liu
  2017-02-04  9:37   ` [dpdk-dev] [PATCH v2 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
  2017-02-04 13:26   ` [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
  2017-02-04 16:37 ` [dpdk-dev] [PATCH v3 " Jianbo Liu
  2017-02-09  4:05 ` [dpdk-dev] [PATCH v4 " Jianbo Liu
  4 siblings, 2 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04  9:37 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

To get better performance, Rx bulk alloc recv function will scan 8 descs
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scan DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 36f1c02..613890e 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = 0; j < LOOK_AHEAD; j++)
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
 
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
-			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
-						       lo_dword.data);
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
-		nb_dd = 0;
-		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
+				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
+			;
+
+		for (j = 0; j < nb_dd; j++)
+			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
+						       lo_dword.data);
 
 		nb_rx += nb_dd;
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2017-02-04  9:37 ` [dpdk-dev] [PATCH v2 " Jianbo Liu
@ 2017-02-04  9:37   ` Jianbo Liu
  2017-02-04 13:26   ` [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
  1 sibling, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04  9:37 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

vPMD will check 4 descs in one time, but the statuses are not consistent
because the memory allocated for RX descriptors is cacheable huagepage.
This patch is to calculate the number of received packets by scann DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index f96cc85..0b1338d 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -196,7 +196,6 @@
 	struct ixgbe_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
-	uint64_t var;
 	uint8x16_t shuf_msk = {
 		0xFF, 0xFF,
 		0xFF, 0xFF,  /* skip 32 bits pkt_type */
@@ -255,6 +254,7 @@
 		uint64x2_t mbp1, mbp2;
 		uint8x16_t staterr;
 		uint16x8_t tmp;
+		uint32_t var = 0;
 		uint32_t stat;
 
 		/* B.1 load 1 mbuf point */
@@ -349,11 +349,19 @@
 		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
 			 pkt_mb1);
 
+		stat &= IXGBE_VPMD_DESC_DD_MASK;
+
 		/* C.4 calc avaialbe number of desc */
-		var =  __builtin_popcount(stat & IXGBE_VPMD_DESC_DD_MASK);
-		nb_pkts_recd += var;
-		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
+		if (likely(var != IXGBE_VPMD_DESC_DD_MASK)) {
+			while (stat & 0x01) {
+				++var;
+				stat = stat >> 8;
+			}
+			nb_pkts_recd += var;
 			break;
+		} else {
+			nb_pkts_recd += RTE_IXGBE_DESCS_PER_LOOP;
+		}
 	}
 
 	/* Update our internal tail pointer */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-04  9:37 ` [dpdk-dev] [PATCH v2 " Jianbo Liu
  2017-02-04  9:37   ` [dpdk-dev] [PATCH v2 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2017-02-04 13:26   ` Ananyev, Konstantin
  2017-02-08 18:02     ` Ferruh Yigit
  1 sibling, 1 reply; 25+ messages in thread
From: Ananyev, Konstantin @ 2017-02-04 13:26 UTC (permalink / raw)
  To: Jianbo Liu, dev, Zhang, Helin, jerin.jacob

> 
> To get better performance, Rx bulk alloc recv function will scan 8 descs
> in one time, but the statuses are not consistent on ARM platform because
> the memory allocated for Rx descriptors is cacheable hugepages.
> This patch is to calculate the number of received packets by scan DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 36f1c02..613890e 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>  		/* Read desc statuses backwards to avoid race condition */
> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
> +		for (j = 0; j < LOOK_AHEAD; j++)
>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> 
> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> -						       lo_dword.data);
> +		rte_smp_rmb();
> 
>  		/* Compute how many status bits were set */
> -		nb_dd = 0;
> -		for (j = 0; j < LOOK_AHEAD; ++j)
> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
> +			;
> +
> +		for (j = 0; j < nb_dd; j++)
> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> +						       lo_dword.data);
> 
>  		nb_rx += nb_dd;
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
                   ` (2 preceding siblings ...)
  2017-02-04  9:37 ` [dpdk-dev] [PATCH v2 " Jianbo Liu
@ 2017-02-04 16:37 ` Jianbo Liu
  2017-02-04 16:37   ` [dpdk-dev] [PATCH v3 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
  2017-02-04 16:39   ` [dpdk-dev] [PATCH v3 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
  2017-02-09  4:05 ` [dpdk-dev] [PATCH v4 " Jianbo Liu
  4 siblings, 2 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04 16:37 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

To get better performance, Rx bulk alloc recv function will scan 8 descs
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scan DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 36f1c02..613890e 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = 0; j < LOOK_AHEAD; j++)
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
 
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
-			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
-						       lo_dword.data);
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
-		nb_dd = 0;
-		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
+				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
+			;
+
+		for (j = 0; j < nb_dd; j++)
+			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
+						       lo_dword.data);
 
 		nb_rx += nb_dd;
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2017-02-04 16:37 ` [dpdk-dev] [PATCH v3 " Jianbo Liu
@ 2017-02-04 16:37   ` Jianbo Liu
  2017-02-04 16:39   ` [dpdk-dev] [PATCH v3 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
  1 sibling, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04 16:37 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

vPMD will check 4 descs in one time, but the statuses are not consistent
because the memory allocated for RX descriptors is cacheable huagepage.
This patch is to calculate the number of received packets by scann DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index f96cc85..2a61322 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -196,7 +196,6 @@
 	struct ixgbe_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
-	uint64_t var;
 	uint8x16_t shuf_msk = {
 		0xFF, 0xFF,
 		0xFF, 0xFF,  /* skip 32 bits pkt_type */
@@ -255,15 +254,15 @@
 		uint64x2_t mbp1, mbp2;
 		uint8x16_t staterr;
 		uint16x8_t tmp;
+		uint32_t var = 0;
 		uint32_t stat;
 
 		/* B.1 load 1 mbuf point */
 		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
 
-		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
-		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
-		rte_rmb();
+		/* A.1 load 1 pkts desc */
+		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		rte_smp_rmb();
 
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
@@ -271,10 +270,11 @@
 		/* B.1 load 1 mbuf point */
 		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
-		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
-		/* B.1 load 2 mbuf point */
 		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
-		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+
+		/* A.1 load 2 pkts descs */
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
 
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
@@ -349,11 +349,19 @@
 		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
 			 pkt_mb1);
 
+		stat &= IXGBE_VPMD_DESC_DD_MASK;
+
 		/* C.4 calc avaialbe number of desc */
-		var =  __builtin_popcount(stat & IXGBE_VPMD_DESC_DD_MASK);
-		nb_pkts_recd += var;
-		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
+		if (likely(stat != IXGBE_VPMD_DESC_DD_MASK)) {
+			while (stat & 0x01) {
+				++var;
+				stat = stat >> 8;
+			}
+			nb_pkts_recd += var;
 			break;
+		} else {
+			nb_pkts_recd += RTE_IXGBE_DESCS_PER_LOOP;
+		}
 	}
 
 	/* Update our internal tail pointer */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-04 16:37 ` [dpdk-dev] [PATCH v3 " Jianbo Liu
  2017-02-04 16:37   ` [dpdk-dev] [PATCH v3 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2017-02-04 16:39   ` Jianbo Liu
  1 sibling, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-04 16:39 UTC (permalink / raw)
  To: dev, Zhang, Helin, Ananyev, Konstantin, Jerin Jacob; +Cc: Jianbo Liu

On 5 February 2017 at 00:37, Jianbo Liu <jianbo.liu@linaro.org> wrote:
> To get better performance, Rx bulk alloc recv function will scan 8 descs
> in one time, but the statuses are not consistent on ARM platform because
> the memory allocated for Rx descriptors is cacheable hugepages.
> This patch is to calculate the number of received packets by scan DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
>
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 36f1c02..613890e 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

There is no change for this patch from v2 to v3.
But the other in this patchset, reading desc statuses is changed to be
in order, not backward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-04 13:26   ` [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
@ 2017-02-08 18:02     ` Ferruh Yigit
  2017-02-08 18:53       ` Ananyev, Konstantin
  0 siblings, 1 reply; 25+ messages in thread
From: Ferruh Yigit @ 2017-02-08 18:02 UTC (permalink / raw)
  To: Ananyev, Konstantin, Jianbo Liu, dev, Zhang, Helin, jerin.jacob

On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
>>
>> To get better performance, Rx bulk alloc recv function will scan 8 descs
>> in one time, but the statuses are not consistent on ARM platform because
>> the memory allocated for Rx descriptors is cacheable hugepages.
>> This patch is to calculate the number of received packets by scan DD bit
>> sequentially, and stops when meeting the first packet with DD bit unset.
>>
>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> ---
>>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 36f1c02..613890e 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
>>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>>  		/* Read desc statuses backwards to avoid race condition */
>> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
>> +		for (j = 0; j < LOOK_AHEAD; j++)
>>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>>
>> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
>> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> -						       lo_dword.data);
>> +		rte_smp_rmb();
>>
>>  		/* Compute how many status bits were set */
>> -		nb_dd = 0;
>> -		for (j = 0; j < LOOK_AHEAD; ++j)
>> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
>> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
>> +			;
>> +
>> +		for (j = 0; j < nb_dd; j++)
>> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> +						       lo_dword.data);
>>
>>  		nb_rx += nb_dd;
>>
>> --
> 
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Hi Konstantin,

Is the ack valid for v3 and both patches?

Thanks,
ferruh

> 
>> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-08 18:02     ` Ferruh Yigit
@ 2017-02-08 18:53       ` Ananyev, Konstantin
  2017-02-08 19:53         ` Ananyev, Konstantin
  0 siblings, 1 reply; 25+ messages in thread
From: Ananyev, Konstantin @ 2017-02-08 18:53 UTC (permalink / raw)
  To: Yigit, Ferruh, Jianbo Liu, dev, Zhang, Helin, jerin.jacob

Hi Ferruh,

> 
> On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
> >>
> >> To get better performance, Rx bulk alloc recv function will scan 8 descs
> >> in one time, but the statuses are not consistent on ARM platform because
> >> the memory allocated for Rx descriptors is cacheable hugepages.
> >> This patch is to calculate the number of received packets by scan DD bit
> >> sequentially, and stops when meeting the first packet with DD bit unset.
> >>
> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> >> ---
> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
> >>  1 file changed, 9 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index 36f1c02..613890e 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
> >>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
> >>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
> >>  		/* Read desc statuses backwards to avoid race condition */
> >> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
> >> +		for (j = 0; j < LOOK_AHEAD; j++)
> >>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> >>
> >> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
> >> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> >> -						       lo_dword.data);
> >> +		rte_smp_rmb();
> >>
> >>  		/* Compute how many status bits were set */
> >> -		nb_dd = 0;
> >> -		for (j = 0; j < LOOK_AHEAD; ++j)
> >> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> >> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
> >> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
> >> +			;
> >> +
> >> +		for (j = 0; j < nb_dd; j++)
> >> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> >> +						       lo_dword.data);
> >>
> >>  		nb_rx += nb_dd;
> >>
> >> --
> >
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 
> Hi Konstantin,
> 
> Is the ack valid for v3 and both patches?

No, I didn't look into the second one in details.
It is ARM specific, and I left it for people who are more familiar with ARM then me :)
Konstantin

> 
> Thanks,
> ferruh
> 
> >
> >> 1.8.3.1
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-08 18:53       ` Ananyev, Konstantin
@ 2017-02-08 19:53         ` Ananyev, Konstantin
  2017-02-09  3:49           ` Jianbo Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Ananyev, Konstantin @ 2017-02-08 19:53 UTC (permalink / raw)
  To: Ananyev, Konstantin, Yigit, Ferruh, Jianbo Liu, dev, Zhang,
	Helin, jerin.jacob



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Wednesday, February 8, 2017 6:54 PM
> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Jianbo Liu <jianbo.liu@linaro.org>; dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>;
> jerin.jacob@caviumnetworks.com
> Subject: Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
> 
> Hi Ferruh,
> 
> >
> > On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
> > >>
> > >> To get better performance, Rx bulk alloc recv function will scan 8 descs
> > >> in one time, but the statuses are not consistent on ARM platform because
> > >> the memory allocated for Rx descriptors is cacheable hugepages.
> > >> This patch is to calculate the number of received packets by scan DD bit
> > >> sequentially, and stops when meeting the first packet with DD bit unset.
> > >>
> > >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> > >> ---
> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
> > >>  1 file changed, 9 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> index 36f1c02..613890e 100644
> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
> > >>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
> > >>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
> > >>  		/* Read desc statuses backwards to avoid race condition */
> > >> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
> > >> +		for (j = 0; j < LOOK_AHEAD; j++)
> > >>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> > >>
> > >> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
> > >> -			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> > >> -						       lo_dword.data);
> > >> +		rte_smp_rmb();
> > >>
> > >>  		/* Compute how many status bits were set */
> > >> -		nb_dd = 0;
> > >> -		for (j = 0; j < LOOK_AHEAD; ++j)
> > >> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> > >> +		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
> > >> +				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
> > >> +			;
> > >> +
> > >> +		for (j = 0; j < nb_dd; j++)
> > >> +			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
> > >> +						       lo_dword.data);
> > >>
> > >>  		nb_rx += nb_dd;
> > >>
> > >> --
> > >
> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >
> > Hi Konstantin,
> >
> > Is the ack valid for v3 and both patches?
> 
> No, I didn't look into the second one in details.
> It is ARM specific, and I left it for people who are more familiar with ARM then me :)
> Konstantin

Actually, I had a quick look after your mail.

+		/* A.1 load 1 pkts desc */
+		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		rte_smp_rmb();

 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
@@ -271,10 +270,11 @@ 
 		/* B.1 load 1 mbuf point */
 		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
-		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
-		/* B.1 load 2 mbuf point */
 		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
-		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+
+		/* A.1 load 2 pkts descs */
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

Assuming that on all ARM-NEON platforms 16B reads are atomic,
I think there is no need for smp_rmb() after the desc[0] read.
What looks more appropriate to me:

descs[0] =  vld1q_u64((uint64_t *)(rxdp));
descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));

rte_smp_rmb();

...

But, as I said would be good if some ARM guys have a look here.
Konstantin


> 
> >
> > Thanks,
> > ferruh
> >
> > >
> > >> 1.8.3.1
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-08 19:53         ` Ananyev, Konstantin
@ 2017-02-09  3:49           ` Jianbo Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-09  3:49 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: Yigit, Ferruh, dev, Zhang, Helin, jerin.jacob

On 9 February 2017 at 03:53, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
>> Sent: Wednesday, February 8, 2017 6:54 PM
>> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Jianbo Liu <jianbo.liu@linaro.org>; dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>;
>> jerin.jacob@caviumnetworks.com
>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> Hi Ferruh,
>>
>> >
>> > On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
>> > >>
>> > >> To get better performance, Rx bulk alloc recv function will scan 8 descs
>> > >> in one time, but the statuses are not consistent on ARM platform because
>> > >> the memory allocated for Rx descriptors is cacheable hugepages.
>> > >> This patch is to calculate the number of received packets by scan DD bit
>> > >> sequentially, and stops when meeting the first packet with DD bit unset.
>> > >>
>> > >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> > >> ---
>> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
>> > >>  1 file changed, 9 insertions(+), 7 deletions(-)
>> > >>
>> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> index 36f1c02..613890e 100644
>> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
>> > >>          for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>> > >>               i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>> > >>                  /* Read desc statuses backwards to avoid race condition */
>> > >> -                for (j = LOOK_AHEAD-1; j >= 0; --j)
>> > >> +                for (j = 0; j < LOOK_AHEAD; j++)
>> > >>                          s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> > >>
>> > >> -                for (j = LOOK_AHEAD - 1; j >= 0; --j)
>> > >> -                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> > >> -                                                       lo_dword.data);
>> > >> +                rte_smp_rmb();
>> > >>
>> > >>                  /* Compute how many status bits were set */
>> > >> -                nb_dd = 0;
>> > >> -                for (j = 0; j < LOOK_AHEAD; ++j)
>> > >> -                        nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> > >> +                for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
>> > >> +                                (s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
>> > >> +                        ;
>> > >> +
>> > >> +                for (j = 0; j < nb_dd; j++)
>> > >> +                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> > >> +                                                       lo_dword.data);
>> > >>
>> > >>                  nb_rx += nb_dd;
>> > >>
>> > >> --
>> > >
>> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> >
>> > Hi Konstantin,
>> >
>> > Is the ack valid for v3 and both patches?
>>
>> No, I didn't look into the second one in details.
>> It is ARM specific, and I left it for people who are more familiar with ARM then me :)
>> Konstantin
>
> Actually, I had a quick look after your mail.
>
> +               /* A.1 load 1 pkts desc */
> +               descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> +               rte_smp_rmb();
>
>                 /* B.2 copy 2 mbuf point into rx_pkts  */
>                 vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
> @@ -271,10 +270,11 @@
>                 /* B.1 load 1 mbuf point */
>                 mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
>
> -               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> -               /* B.1 load 2 mbuf point */
>                 descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
> -               descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> +
> +               /* A.1 load 2 pkts descs */
> +               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> +               descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
>
> Assuming that on all ARM-NEON platforms 16B reads are atomic,
> I think there is no need for smp_rmb() after the desc[0] read.
> What looks more appropriate to me:

With checking DDs in sequence, it doesn't matter much where the rmb is.
But there is a little performance improvement (0.02%) in my testing
with your suggestion.
So I'll send a new version. Thanks!

>
> descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
> descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
>
> rte_smp_rmb();
>
> ...
>
> But, as I said would be good if some ARM guys have a look here.
> Konstantin
>
>
>>
>> >
>> > Thanks,
>> > ferruh
>> >
>> > >
>> > >> 1.8.3.1
>> > >
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
                   ` (3 preceding siblings ...)
  2017-02-04 16:37 ` [dpdk-dev] [PATCH v3 " Jianbo Liu
@ 2017-02-09  4:05 ` Jianbo Liu
  2017-02-09  4:05   ` [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
  2017-02-09 12:39   ` [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ferruh Yigit
  4 siblings, 2 replies; 25+ messages in thread
From: Jianbo Liu @ 2017-02-09  4:05 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

To get better performance, Rx bulk alloc recv function will scan 8 descs
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scan DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 36f1c02..613890e 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = 0; j < LOOK_AHEAD; j++)
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
 
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
-			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
-						       lo_dword.data);
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
-		nb_dd = 0;
-		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+		for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
+				(s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
+			;
+
+		for (j = 0; j < nb_dd; j++)
+			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
+						       lo_dword.data);
 
 		nb_rx += nb_dd;
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2017-02-09  4:05 ` [dpdk-dev] [PATCH v4 " Jianbo Liu
@ 2017-02-09  4:05   ` Jianbo Liu
  2017-02-09 12:43     ` Ferruh Yigit
  2017-02-09 12:39   ` [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ferruh Yigit
  1 sibling, 1 reply; 25+ messages in thread
From: Jianbo Liu @ 2017-02-09  4:05 UTC (permalink / raw)
  To: dev, helin.zhang, konstantin.ananyev, jerin.jacob; +Cc: Jianbo Liu

vPMD will check 4 descs in one time, but the statuses are not consistent
because the memory allocated for RX descriptors is cacheable huagepage.
This patch is to calculate the number of received packets by scann DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index f96cc85..e2715cb 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -196,7 +196,6 @@
 	struct ixgbe_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
-	uint64_t var;
 	uint8x16_t shuf_msk = {
 		0xFF, 0xFF,
 		0xFF, 0xFF,  /* skip 32 bits pkt_type */
@@ -255,26 +254,24 @@
 		uint64x2_t mbp1, mbp2;
 		uint8x16_t staterr;
 		uint16x8_t tmp;
+		uint32_t var = 0;
 		uint32_t stat;
 
 		/* B.1 load 1 mbuf point */
 		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
 
-		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
-		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
-		rte_rmb();
-
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
 
 		/* B.1 load 1 mbuf point */
 		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
-		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
-		/* B.1 load 2 mbuf point */
-		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
+		/* A. load 4 pkts descs */
 		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
+		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
+		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
+		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
+		rte_smp_rmb();
 
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
@@ -349,11 +346,19 @@
 		vst1q_u8((uint8_t *)&rx_pkts[pos]->rx_descriptor_fields1,
 			 pkt_mb1);
 
+		stat &= IXGBE_VPMD_DESC_DD_MASK;
+
 		/* C.4 calc avaialbe number of desc */
-		var =  __builtin_popcount(stat & IXGBE_VPMD_DESC_DD_MASK);
-		nb_pkts_recd += var;
-		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
+		if (likely(stat != IXGBE_VPMD_DESC_DD_MASK)) {
+			while (stat & 0x01) {
+				++var;
+				stat = stat >> 8;
+			}
+			nb_pkts_recd += var;
 			break;
+		} else {
+			nb_pkts_recd += RTE_IXGBE_DESCS_PER_LOOP;
+		}
 	}
 
 	/* Update our internal tail pointer */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-09  4:05 ` [dpdk-dev] [PATCH v4 " Jianbo Liu
  2017-02-09  4:05   ` [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2017-02-09 12:39   ` Ferruh Yigit
  2017-02-09 12:42     ` Ferruh Yigit
  1 sibling, 1 reply; 25+ messages in thread
From: Ferruh Yigit @ 2017-02-09 12:39 UTC (permalink / raw)
  To: Jianbo Liu, dev, Zhang, Helin, konstantin.ananyev, jerin.jacob,
	dpdk stable

On 2/9/2017 4:05 AM, Jianbo Liu wrote:
> To get better performance, Rx bulk alloc recv function will scan 8 descs
> in one time, but the statuses are not consistent on ARM platform because
> the memory allocated for Rx descriptors is cacheable hugepages.
> This patch is to calculate the number of received packets by scan DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

    net/ixgbe: fix received packets number for ARM NEON

    Fixes: b20971b6cca0 ("net/ixgbe: implement vector driver for ARM")
    Cc: stable@dpdk.org

    Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
    Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>


Applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
  2017-02-09 12:39   ` [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ferruh Yigit
@ 2017-02-09 12:42     ` Ferruh Yigit
  0 siblings, 0 replies; 25+ messages in thread
From: Ferruh Yigit @ 2017-02-09 12:42 UTC (permalink / raw)
  To: Jianbo Liu, dev, Zhang, Helin, konstantin.ananyev, jerin.jacob,
	dpdk stable

On 2/9/2017 12:39 PM, Ferruh Yigit wrote:
> On 2/9/2017 4:05 AM, Jianbo Liu wrote:
>> To get better performance, Rx bulk alloc recv function will scan 8 descs
>> in one time, but the statuses are not consistent on ARM platform because
>> the memory allocated for Rx descriptors is cacheable hugepages.
>> This patch is to calculate the number of received packets by scan DD bit
>> sequentially, and stops when meeting the first packet with DD bit unset.
>>
>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> 
>     net/ixgbe: fix received packets number for ARM NEON
> 
>     Fixes: b20971b6cca0 ("net/ixgbe: implement vector driver for ARM")
>     Cc: stable@dpdk.org
> 
>     Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>     Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Correction:
    net/ixgbe: fix received packets number for ARM

    Fixes: 7431041062b9 ("ixgbe: allow rx bulk alloc")
    Cc: stable@dpdk.org

    Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
    Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 
> 
> Applied to dpdk-next-net/master, thanks.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD
  2017-02-09  4:05   ` [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
@ 2017-02-09 12:43     ` Ferruh Yigit
  0 siblings, 0 replies; 25+ messages in thread
From: Ferruh Yigit @ 2017-02-09 12:43 UTC (permalink / raw)
  To: Jianbo Liu, dev, helin.zhang, konstantin.ananyev, jerin.jacob,
	dpdk stable

On 2/9/2017 4:05 AM, Jianbo Liu wrote:
> vPMD will check 4 descs in one time, but the statuses are not consistent
> because the memory allocated for RX descriptors is cacheable huagepage.
> This patch is to calculate the number of received packets by scann DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

    net/ixgbe: fix received packets number for ARM NEON

    Fixes: b20971b6cca0 ("net/ixgbe: implement vector driver for ARM")
    Cc: stable@dpdk.org

Applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-02-09 12:43 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-19  6:09 [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
2016-12-19  6:09 ` [dpdk-dev] [PATCH 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
2016-12-21 10:08   ` Jerin Jacob
2016-12-21 11:03     ` Bruce Richardson
2016-12-22  1:18       ` Jianbo Liu
2016-12-22  1:05     ` Jianbo Liu
2017-02-01 16:19 ` [dpdk-dev] [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
2017-02-03  6:22   ` Jianbo Liu
2017-02-03 11:38     ` Ananyev, Konstantin
2017-02-04  3:37       ` Jianbo Liu
2017-02-04  9:37 ` [dpdk-dev] [PATCH v2 " Jianbo Liu
2017-02-04  9:37   ` [dpdk-dev] [PATCH v2 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
2017-02-04 13:26   ` [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ananyev, Konstantin
2017-02-08 18:02     ` Ferruh Yigit
2017-02-08 18:53       ` Ananyev, Konstantin
2017-02-08 19:53         ` Ananyev, Konstantin
2017-02-09  3:49           ` Jianbo Liu
2017-02-04 16:37 ` [dpdk-dev] [PATCH v3 " Jianbo Liu
2017-02-04 16:37   ` [dpdk-dev] [PATCH v3 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
2017-02-04 16:39   ` [dpdk-dev] [PATCH v3 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Jianbo Liu
2017-02-09  4:05 ` [dpdk-dev] [PATCH v4 " Jianbo Liu
2017-02-09  4:05   ` [dpdk-dev] [PATCH v4 2/2] net/ixgbe: calculate correct number of received packets for ARM NEON-version vPMD Jianbo Liu
2017-02-09 12:43     ` Ferruh Yigit
2017-02-09 12:39   ` [dpdk-dev] [PATCH v4 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function Ferruh Yigit
2017-02-09 12:42     ` Ferruh Yigit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).