DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] Add missing prefetches to i40e bulk rx path
@ 2016-07-14 17:27 Vladyslav Buslov
  2016-07-14 17:27 ` [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx Vladyslav Buslov
  0 siblings, 1 reply; 12+ messages in thread
From: Vladyslav Buslov @ 2016-07-14 17:27 UTC (permalink / raw)
  To: helin.zhang, jingjing.wu; +Cc: dev

Hello,

Recently I tried to use bulk rx function to reduce CPU usage of rte_eth_rx_burst.
However application performance with i40e_recv_pkts_bulk_alloc was significantly worse than with i40e_recv_pkts. (3m less PPS, 0.5 IPC on receiving core)

Quick investigation revealed two problems:
 - First payload cacheline is prefetched in i40e_recv_pkts but not in i40e_recv_pkts_bulk_alloc.
 - Only first line of next mbuf is prefetched during mbuf init in i40e_rx_alloc_bufs. This causes cache miss at setting 'next' field from mbuf cacheline1 to NULL.

Fixing these two small issues significantly reduced CPU time spent in rte_eth_rx_burst and improved PPS compared to both original i40e_recv_pkts_bulk_alloc and i40e_recv_pkts.

Regards,

Vladyslav Buslov (1):
  net/i40e: add additional prefetch instructions for bulk rx

 drivers/net/i40e/i40e_rxtx.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

-- 
2.8.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-07-14 17:27 [dpdk-dev] [PATCH] Add missing prefetches to i40e bulk rx path Vladyslav Buslov
@ 2016-07-14 17:27 ` Vladyslav Buslov
  2016-09-14 13:24   ` Ferruh Yigit
  0 siblings, 1 reply; 12+ messages in thread
From: Vladyslav Buslov @ 2016-07-14 17:27 UTC (permalink / raw)
  To: helin.zhang, jingjing.wu; +Cc: dev

Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs

Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
---
 drivers/net/i40e/i40e_rxtx.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index d3cfb98..e493fb4 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 		/* Translate descriptor info to mbuf parameters */
 		for (j = 0; j < nb_dd; j++) {
 			mb = rxep[j].mbuf;
+			rte_prefetch0(RTE_PTR_ADD(mb->buf_addr, RTE_PKTMBUF_HEADROOM));
 			qword1 = rte_le_to_cpu_64(\
 				rxdp[j].wb.qword1.status_error_len);
 			pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
@@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 
 	rxdp = &rxq->rx_ring[alloc_idx];
 	for (i = 0; i < rxq->rx_free_thresh; i++) {
-		if (likely(i < (rxq->rx_free_thresh - 1)))
+		if (likely(i < (rxq->rx_free_thresh - 1))) {
 			/* Prefetch next mbuf */
-			rte_prefetch0(rxep[i + 1].mbuf);
+			rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
+			rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
+		}
 
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
-- 
2.8.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-07-14 17:27 ` [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx Vladyslav Buslov
@ 2016-09-14 13:24   ` Ferruh Yigit
  2016-10-10 13:25     ` Wu, Jingjing
  0 siblings, 1 reply; 12+ messages in thread
From: Ferruh Yigit @ 2016-09-14 13:24 UTC (permalink / raw)
  To: Vladyslav Buslov, Zhang, Helin, Wu, Jingjing; +Cc: dev

On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
> Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs
> 
> Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> index d3cfb98..e493fb4 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>                 /* Translate descriptor info to mbuf parameters */
>                 for (j = 0; j < nb_dd; j++) {
>                         mb = rxep[j].mbuf;
> +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr, RTE_PKTMBUF_HEADROOM));
>                         qword1 = rte_le_to_cpu_64(\
>                                 rxdp[j].wb.qword1.status_error_len);
>                         pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> 
>         rxdp = &rxq->rx_ring[alloc_idx];
>         for (i = 0; i < rxq->rx_free_thresh; i++) {
> -               if (likely(i < (rxq->rx_free_thresh - 1)))
> +               if (likely(i < (rxq->rx_free_thresh - 1))) {
>                         /* Prefetch next mbuf */
> -                       rte_prefetch0(rxep[i + 1].mbuf);
> +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> +               }
> 
>                 mb = rxep[i].mbuf;
>                 rte_mbuf_refcnt_set(mb, 1);
> --
> 2.8.3
> 

i40e maintainers, can you please review the patch?

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-09-14 13:24   ` Ferruh Yigit
@ 2016-10-10 13:25     ` Wu, Jingjing
  2016-10-10 17:05       ` Vladyslav Buslov
  0 siblings, 1 reply; 12+ messages in thread
From: Wu, Jingjing @ 2016-10-10 13:25 UTC (permalink / raw)
  To: Yigit, Ferruh, Vladyslav Buslov, Zhang, Helin; +Cc: dev



> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Wednesday, September 14, 2016 9:25 PM
> To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
> > Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs
> >
> > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> > index d3cfb98..e493fb4 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >                 /* Translate descriptor info to mbuf parameters */
> >                 for (j = 0; j < nb_dd; j++) {
> >                         mb = rxep[j].mbuf;
> > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> RTE_PKTMBUF_HEADROOM));

Why did prefetch here? I think if application need to deal with packet, it is more suitable to put it in application.

> >                         qword1 = rte_le_to_cpu_64(\
> >                                 rxdp[j].wb.qword1.status_error_len);
> >                         pkt_len = ((qword1 &
> I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> >
> >         rxdp = &rxq->rx_ring[alloc_idx];
> >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> >                         /* Prefetch next mbuf */
> > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> > +               }
Agree with this change. And when I test it by testpmd with iofwd, no performance increase is observed but minor decrease.
Can you share will us when it will benefit the performance in your scenario ? 


Thanks
Jingjing

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-10 13:25     ` Wu, Jingjing
@ 2016-10-10 17:05       ` Vladyslav Buslov
  2016-10-11  8:51         ` Ananyev, Konstantin
  0 siblings, 1 reply; 12+ messages in thread
From: Vladyslav Buslov @ 2016-10-10 17:05 UTC (permalink / raw)
  To: Wu, Jingjing, Yigit, Ferruh, Zhang, Helin; +Cc: dev

> -----Original Message-----
> From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> Sent: Monday, October 10, 2016 4:26 PM
> To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> 
> 
> > -----Original Message-----
> > From: Yigit, Ferruh
> > Sent: Wednesday, September 14, 2016 9:25 PM
> > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > Added prefetch of first packet payload cacheline in
> > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > i40e_rx_alloc_bufs
> > >
> > > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> *rxq)
> > >                 /* Translate descriptor info to mbuf parameters */
> > >                 for (j = 0; j < nb_dd; j++) {
> > >                         mb = rxep[j].mbuf;
> > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > RTE_PKTMBUF_HEADROOM));
> 
> Why did prefetch here? I think if application need to deal with packet, it is
> more suitable to put it in application.
> 
> > >                         qword1 = rte_le_to_cpu_64(\
> > >                                 rxdp[j].wb.qword1.status_error_len);
> > >                         pkt_len = ((qword1 &
> > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> *rxq)
> > >
> > >         rxdp = &rxq->rx_ring[alloc_idx];
> > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > >                         /* Prefetch next mbuf */
> > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> > > +               }
> Agree with this change. And when I test it by testpmd with iofwd, no
> performance increase is observed but minor decrease.
> Can you share will us when it will benefit the performance in your scenario ?
> 
> 
> Thanks
> Jingjing

Hello Jingjing,

Thanks for code review.

My use case: We have simple distributor thread that receives packets from port and distributes them among worker threads according to VLAN and MAC address hash. 

While working on performance optimization we determined that most of CPU usage of this thread is in DPDK.
As and optimization we decided to switch to rx burst alloc function, however that caused additional performance degradation compared to scatter rx mode.
In profiler two major culprits were:
  1. Access to packet data Eth header in application code. (cache miss)
  2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs code. (this field is in second descriptor cache line that was not prefetched)
After applying my fixes performance improved compared to scatter rx mode.

I assumed that prefetch of first cache line of packet data belongs to DPDK because it is done in scatter rx mode. (in i40e_recv_scattered_pkts)
It can be moved to application side but IMO it is better to be consistent across all rx modes.

Regards,
Vladyslav

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-10 17:05       ` Vladyslav Buslov
@ 2016-10-11  8:51         ` Ananyev, Konstantin
  2016-10-11  9:24           ` Vladyslav Buslov
  0 siblings, 1 reply; 12+ messages in thread
From: Ananyev, Konstantin @ 2016-10-11  8:51 UTC (permalink / raw)
  To: Vladyslav Buslov, Wu, Jingjing, Yigit, Ferruh, Zhang, Helin; +Cc: dev

Hi Vladislav,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> Sent: Monday, October 10, 2016 6:06 PM
> To: Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> > -----Original Message-----
> > From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> > Sent: Monday, October 10, 2016 4:26 PM
> > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> >
> >
> > > -----Original Message-----
> > > From: Yigit, Ferruh
> > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > Added prefetch of first packet payload cacheline in
> > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > i40e_rx_alloc_bufs
> > > >
> > > > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > *rxq)
> > > >                 /* Translate descriptor info to mbuf parameters */
> > > >                 for (j = 0; j < nb_dd; j++) {
> > > >                         mb = rxep[j].mbuf;
> > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > RTE_PKTMBUF_HEADROOM));
> >
> > Why did prefetch here? I think if application need to deal with packet, it is
> > more suitable to put it in application.
> >
> > > >                         qword1 = rte_le_to_cpu_64(\
> > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > >                         pkt_len = ((qword1 &
> > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > *rxq)
> > > >
> > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > >                         /* Prefetch next mbuf */
> > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);

I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
specially for that case.

> > > > +               }
> > Agree with this change. And when I test it by testpmd with iofwd, no
> > performance increase is observed but minor decrease.
> > Can you share will us when it will benefit the performance in your scenario ?
> >
> >
> > Thanks
> > Jingjing
> 
> Hello Jingjing,
> 
> Thanks for code review.
> 
> My use case: We have simple distributor thread that receives packets from port and distributes them among worker threads according to
> VLAN and MAC address hash.
> 
> While working on performance optimization we determined that most of CPU usage of this thread is in DPDK.
> As and optimization we decided to switch to rx burst alloc function, however that caused additional performance degradation compared to
> scatter rx mode.
> In profiler two major culprits were:
>   1. Access to packet data Eth header in application code. (cache miss)
>   2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs code. (this field is in second descriptor cache line that was not
> prefetched)

I wonder what will happen if we'll remove any prefetches here?
Would it make things better or worse (and by how much)?

> After applying my fixes performance improved compared to scatter rx mode.
> 
> I assumed that prefetch of first cache line of packet data belongs to DPDK because it is done in scatter rx mode. (in
> i40e_recv_scattered_pkts)
> It can be moved to application side but IMO it is better to be consistent across all rx modes.

I would agree with Jingjing here, probably PMD should avoid to prefetch packet's data. 
Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-11  8:51         ` Ananyev, Konstantin
@ 2016-10-11  9:24           ` Vladyslav Buslov
  2016-10-12  0:04             ` Ananyev, Konstantin
  0 siblings, 1 reply; 12+ messages in thread
From: Vladyslav Buslov @ 2016-10-11  9:24 UTC (permalink / raw)
  To: Ananyev, Konstantin, Wu, Jingjing, Yigit, Ferruh, Zhang, Helin; +Cc: dev

Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Tuesday, October 11, 2016 11:51 AM
> To: Vladyslav Buslov; Wu, Jingjing; Yigit, Ferruh; Zhang, Helin
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> Hi Vladislav,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> > Sent: Monday, October 10, 2016 6:06 PM
> > To: Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > > -----Original Message-----
> > > From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> > > Sent: Monday, October 10, 2016 4:26 PM
> > > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yigit, Ferruh
> > > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang,
> > > > Helin <helin.zhang@intel.com>; Wu, Jingjing
> > > > <jingjing.wu@intel.com>
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > > instructions for bulk rx
> > > >
> > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > Added prefetch of first packet payload cacheline in
> > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > i40e_rx_alloc_bufs
> > > > >
> > > > > Signed-off-by: Vladyslav Buslov
> > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > ---
> > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> i40e_rx_queue
> > > *rxq)
> > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > >                         mb = rxep[j].mbuf;
> > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > RTE_PKTMBUF_HEADROOM));
> > >
> > > Why did prefetch here? I think if application need to deal with
> > > packet, it is more suitable to put it in application.
> > >
> > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > >                         pkt_len = ((qword1 &
> > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > *rxq)
> > > > >
> > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > >                         /* Prefetch next mbuf */
> > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > +                       rte_prefetch0(&rxep[i +
> > > > > + 1].mbuf->cacheline1);
> 
> I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> specially for that case.

Thanks for pointing that out.
I'll submit new patch if you decide to move forward with this development.

> 
> > > > > +               }
> > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > performance increase is observed but minor decrease.
> > > Can you share will us when it will benefit the performance in your
> scenario ?
> > >
> > >
> > > Thanks
> > > Jingjing
> >
> > Hello Jingjing,
> >
> > Thanks for code review.
> >
> > My use case: We have simple distributor thread that receives packets
> > from port and distributes them among worker threads according to VLAN
> and MAC address hash.
> >
> > While working on performance optimization we determined that most of
> CPU usage of this thread is in DPDK.
> > As and optimization we decided to switch to rx burst alloc function,
> > however that caused additional performance degradation compared to
> scatter rx mode.
> > In profiler two major culprits were:
> >   1. Access to packet data Eth header in application code. (cache miss)
> >   2. Setting next packet descriptor field to NULL in DPDK
> > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > line that was not
> > prefetched)
> 
> I wonder what will happen if we'll remove any prefetches here?
> Would it make things better or worse (and by how much)?

In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.

> 
> > After applying my fixes performance improved compared to scatter rx
> mode.
> >
> > I assumed that prefetch of first cache line of packet data belongs to
> > DPDK because it is done in scatter rx mode. (in
> > i40e_recv_scattered_pkts)
> > It can be moved to application side but IMO it is better to be consistent
> across all rx modes.
> 
> I would agree with Jingjing here, probably PMD should avoid to prefetch
> packet's data.

Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive packets directly from rx queues of NIC.
First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each eth header and can't be easily mitigated in application code.
I assume it is ubiquitous use case for DPDK.

Regards,
Vladyslav

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-11  9:24           ` Vladyslav Buslov
@ 2016-10-12  0:04             ` Ananyev, Konstantin
  2016-10-13 10:18               ` Bruce Richardson
  0 siblings, 1 reply; 12+ messages in thread
From: Ananyev, Konstantin @ 2016-10-12  0:04 UTC (permalink / raw)
  To: Vladyslav Buslov, Wu, Jingjing, Yigit, Ferruh, Zhang, Helin; +Cc: dev

Hi Vladislav,

> > > > >
> > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > Added prefetch of first packet payload cacheline in
> > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > i40e_rx_alloc_bufs
> > > > > >
> > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > ---
> > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > i40e_rx_queue
> > > > *rxq)
> > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > >                         mb = rxep[j].mbuf;
> > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > RTE_PKTMBUF_HEADROOM));
> > > >
> > > > Why did prefetch here? I think if application need to deal with
> > > > packet, it is more suitable to put it in application.
> > > >
> > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > >                         pkt_len = ((qword1 &
> > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > *rxq)
> > > > > >
> > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > >                         /* Prefetch next mbuf */
> > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > + 1].mbuf->cacheline1);
> >
> > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > specially for that case.
> 
> Thanks for pointing that out.
> I'll submit new patch if you decide to move forward with this development.
> 
> >
> > > > > > +               }
> > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > performance increase is observed but minor decrease.
> > > > Can you share will us when it will benefit the performance in your
> > scenario ?
> > > >
> > > >
> > > > Thanks
> > > > Jingjing
> > >
> > > Hello Jingjing,
> > >
> > > Thanks for code review.
> > >
> > > My use case: We have simple distributor thread that receives packets
> > > from port and distributes them among worker threads according to VLAN
> > and MAC address hash.
> > >
> > > While working on performance optimization we determined that most of
> > CPU usage of this thread is in DPDK.
> > > As and optimization we decided to switch to rx burst alloc function,
> > > however that caused additional performance degradation compared to
> > scatter rx mode.
> > > In profiler two major culprits were:
> > >   1. Access to packet data Eth header in application code. (cache miss)
> > >   2. Setting next packet descriptor field to NULL in DPDK
> > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > line that was not
> > > prefetched)
> >
> > I wonder what will happen if we'll remove any prefetches here?
> > Would it make things better or worse (and by how much)?
> 
> In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> 
> >
> > > After applying my fixes performance improved compared to scatter rx
> > mode.
> > >
> > > I assumed that prefetch of first cache line of packet data belongs to
> > > DPDK because it is done in scatter rx mode. (in
> > > i40e_recv_scattered_pkts)
> > > It can be moved to application side but IMO it is better to be consistent
> > across all rx modes.
> >
> > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > packet's data.
> 
> Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> packets directly from rx queues of NIC.
> First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> eth header and can't be easily mitigated in application code.
> I assume it is ubiquitous use case for DPDK.

Yes it is a quite common use-case.
Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
>From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
Another thing that some people may argue then - why only one cache line is prefetched,
in some use-cases might need to look at 2-nd one.  

Konstantin

> 
> Regards,
> Vladyslav

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-12  0:04             ` Ananyev, Konstantin
@ 2016-10-13 10:18               ` Bruce Richardson
  2016-10-13 10:30                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 12+ messages in thread
From: Bruce Richardson @ 2016-10-13 10:18 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Vladyslav Buslov, Wu, Jingjing, Yigit, Ferruh, Zhang, Helin, dev

On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> Hi Vladislav,
> 
> > > > > >
> > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > i40e_rx_alloc_bufs
> > > > > > >
> > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > > ---
> > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > i40e_rx_queue
> > > > > *rxq)
> > > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > > >                         mb = rxep[j].mbuf;
> > > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > RTE_PKTMBUF_HEADROOM));
> > > > >
> > > > > Why did prefetch here? I think if application need to deal with
> > > > > packet, it is more suitable to put it in application.
> > > > >
> > > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > > >                         pkt_len = ((qword1 &
> > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > *rxq)
> > > > > > >
> > > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > >                         /* Prefetch next mbuf */
> > > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > > + 1].mbuf->cacheline1);
> > >
> > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > specially for that case.
> > 
> > Thanks for pointing that out.
> > I'll submit new patch if you decide to move forward with this development.
> > 
> > >
> > > > > > > +               }
> > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > performance increase is observed but minor decrease.
> > > > > Can you share will us when it will benefit the performance in your
> > > scenario ?
> > > > >
> > > > >
> > > > > Thanks
> > > > > Jingjing
> > > >
> > > > Hello Jingjing,
> > > >
> > > > Thanks for code review.
> > > >
> > > > My use case: We have simple distributor thread that receives packets
> > > > from port and distributes them among worker threads according to VLAN
> > > and MAC address hash.
> > > >
> > > > While working on performance optimization we determined that most of
> > > CPU usage of this thread is in DPDK.
> > > > As and optimization we decided to switch to rx burst alloc function,
> > > > however that caused additional performance degradation compared to
> > > scatter rx mode.
> > > > In profiler two major culprits were:
> > > >   1. Access to packet data Eth header in application code. (cache miss)
> > > >   2. Setting next packet descriptor field to NULL in DPDK
> > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > line that was not
> > > > prefetched)
> > >
> > > I wonder what will happen if we'll remove any prefetches here?
> > > Would it make things better or worse (and by how much)?
> > 
> > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> > 
> > >
> > > > After applying my fixes performance improved compared to scatter rx
> > > mode.
> > > >
> > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > DPDK because it is done in scatter rx mode. (in
> > > > i40e_recv_scattered_pkts)
> > > > It can be moved to application side but IMO it is better to be consistent
> > > across all rx modes.
> > >
> > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > packet's data.
> > 
> > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> > packets directly from rx queues of NIC.
> > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> > eth header and can't be easily mitigated in application code.
> > I assume it is ubiquitous use case for DPDK.
> 
> Yes it is a quite common use-case.
> Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
> Another thing that some people may argue then - why only one cache line is prefetched,
> in some use-cases might need to look at 2-nd one.  
> 
There is a build-time config setting for this behaviour for exactly the reasons
called out here - in some apps you get a benefit, in others you see a perf
hit. The default is "on", which makes sense for most cases, I think.
>From common_base:

CONFIG_RTE_PMD_PACKET_PREFETCH=y$

/Bruce

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-13 10:18               ` Bruce Richardson
@ 2016-10-13 10:30                 ` Ananyev, Konstantin
  2016-11-15 12:19                   ` Ferruh Yigit
  0 siblings, 1 reply; 12+ messages in thread
From: Ananyev, Konstantin @ 2016-10-13 10:30 UTC (permalink / raw)
  To: Richardson, Bruce
  Cc: Vladyslav Buslov, Wu, Jingjing, Yigit, Ferruh, Zhang, Helin, dev



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, October 13, 2016 11:19 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Vladyslav Buslov <Vladyslav.Buslov@harmonicinc.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>;
> Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> > Hi Vladislav,
> >
> > > > > > >
> > > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > > i40e_rx_alloc_bufs
> > > > > > > >
> > > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > > > ---
> > > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > > i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > > > >                         mb = rxep[j].mbuf;
> > > > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > > RTE_PKTMBUF_HEADROOM));
> > > > > >
> > > > > > Why did prefetch here? I think if application need to deal with
> > > > > > packet, it is more suitable to put it in application.
> > > > > >
> > > > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > > > >                         pkt_len = ((qword1 &
> > > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >
> > > > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > > >                         /* Prefetch next mbuf */
> > > > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > > > + 1].mbuf->cacheline1);
> > > >
> > > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > > specially for that case.
> > >
> > > Thanks for pointing that out.
> > > I'll submit new patch if you decide to move forward with this development.
> > >
> > > >
> > > > > > > > +               }
> > > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > > performance increase is observed but minor decrease.
> > > > > > Can you share will us when it will benefit the performance in your
> > > > scenario ?
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Jingjing
> > > > >
> > > > > Hello Jingjing,
> > > > >
> > > > > Thanks for code review.
> > > > >
> > > > > My use case: We have simple distributor thread that receives packets
> > > > > from port and distributes them among worker threads according to VLAN
> > > > and MAC address hash.
> > > > >
> > > > > While working on performance optimization we determined that most of
> > > > CPU usage of this thread is in DPDK.
> > > > > As and optimization we decided to switch to rx burst alloc function,
> > > > > however that caused additional performance degradation compared to
> > > > scatter rx mode.
> > > > > In profiler two major culprits were:
> > > > >   1. Access to packet data Eth header in application code. (cache miss)
> > > > >   2. Setting next packet descriptor field to NULL in DPDK
> > > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > > line that was not
> > > > > prefetched)
> > > >
> > > > I wonder what will happen if we'll remove any prefetches here?
> > > > Would it make things better or worse (and by how much)?
> > >
> > > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> > >
> > > >
> > > > > After applying my fixes performance improved compared to scatter rx
> > > > mode.
> > > > >
> > > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > > DPDK because it is done in scatter rx mode. (in
> > > > > i40e_recv_scattered_pkts)
> > > > > It can be moved to application side but IMO it is better to be consistent
> > > > across all rx modes.
> > > >
> > > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > > packet's data.
> > >
> > > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will
> receive
> > > packets directly from rx queues of NIC.
> > > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing
> each
> > > eth header and can't be easily mitigated in application code.
> > > I assume it is ubiquitous use case for DPDK.
> >
> > Yes it is a quite common use-case.
> > Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> > From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> > Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
> > Another thing that some people may argue then - why only one cache line is prefetched,
> > in some use-cases might need to look at 2-nd one.
> >
> There is a build-time config setting for this behaviour for exactly the reasons
> called out here - in some apps you get a benefit, in others you see a perf
> hit. The default is "on", which makes sense for most cases, I think.
> From common_base:
> 
> CONFIG_RTE_PMD_PACKET_PREFETCH=y$

Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar) just ignore that flag.
Though yes, might be a good thing to make them to obey that flag properly.
Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-10-13 10:30                 ` Ananyev, Konstantin
@ 2016-11-15 12:19                   ` Ferruh Yigit
  2016-11-15 13:27                     ` Vladyslav Buslov
  0 siblings, 1 reply; 12+ messages in thread
From: Ferruh Yigit @ 2016-11-15 12:19 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce
  Cc: Vladyslav Buslov, Wu, Jingjing, Zhang, Helin, dev

On 10/13/2016 11:30 AM, Ananyev, Konstantin wrote:

<...>

>>>>
>>>> Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
>>>> In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
>>>> However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will
>> receive
>>>> packets directly from rx queues of NIC.
>>>> First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing
>> each
>>>> eth header and can't be easily mitigated in application code.
>>>> I assume it is ubiquitous use case for DPDK.
>>>
>>> Yes it is a quite common use-case.
>>> Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
>>> From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
>>> Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
>>> Another thing that some people may argue then - why only one cache line is prefetched,
>>> in some use-cases might need to look at 2-nd one.
>>>
>> There is a build-time config setting for this behaviour for exactly the reasons
>> called out here - in some apps you get a benefit, in others you see a perf
>> hit. The default is "on", which makes sense for most cases, I think.
>> From common_base:
>>
>> CONFIG_RTE_PMD_PACKET_PREFETCH=y$
> 
> Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar) just ignore that flag.
> Though yes, might be a good thing to make them to obey that flag properly.

Hi Vladyslav,

According Konstantin's comment, what do you think updating patch to do
prefetch within CONFIG_RTE_PMD_PACKET_PREFETCH ifdef?

But since config option is enabled by default, performance concern is
still valid and needs to be investigated.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
  2016-11-15 12:19                   ` Ferruh Yigit
@ 2016-11-15 13:27                     ` Vladyslav Buslov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladyslav Buslov @ 2016-11-15 13:27 UTC (permalink / raw)
  To: Ferruh Yigit, Ananyev, Konstantin, Richardson, Bruce
  Cc: Wu, Jingjing, Zhang, Helin, dev

> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Tuesday, November 15, 2016 2:19 PM
> To: Ananyev, Konstantin; Richardson, Bruce
> Cc: Vladyslav Buslov; Wu, Jingjing; Zhang, Helin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> On 10/13/2016 11:30 AM, Ananyev, Konstantin wrote:
> 
> <...>
> 
> >>>>
> >>>> Actually I can see some valid use cases where it is beneficial to have this
> prefetch in driver.
> >>>> In our sw distributor case it is trivial to just prefetch next packet on
> each iteration because packets are processed one by one.
> >>>> However when we move this functionality to hw by means of
> >>>> RSS/vfunction/FlowDirector(our long term goal) worker threads will
> >> receive
> >>>> packets directly from rx queues of NIC.
> >>>> First operation of worker thread is to perform bulk lookup in hash
> >>>> table by destination MAC. This will cause cache miss on accessing
> >> each
> >>>> eth header and can't be easily mitigated in application code.
> >>>> I assume it is ubiquitous use case for DPDK.
> >>>
> >>> Yes it is a quite common use-case.
> >>> Though I many cases it is possible to reorder user code to hide (or
> minimize) that data-access latency.
> >>> From other side there are scenarios where this prefetch is excessive and
> can cause some drop in performance.
> >>> Again, as I know, none of PMDs for Intel devices prefetches packet's
> data in  simple (single segment) RX mode.
> >>> Another thing that some people may argue then - why only one cache
> >>> line is prefetched, in some use-cases might need to look at 2-nd one.
> >>>
> >> There is a build-time config setting for this behaviour for exactly
> >> the reasons called out here - in some apps you get a benefit, in
> >> others you see a perf hit. The default is "on", which makes sense for most
> cases, I think.
> >> From common_base:
> >>
> >> CONFIG_RTE_PMD_PACKET_PREFETCH=y$
> >
> > Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar)
> just ignore that flag.
> > Though yes, might be a good thing to make them to obey that flag
> properly.
> 
> Hi Vladyslav,
> 
> According Konstantin's comment, what do you think updating patch to do
> prefetch within CONFIG_RTE_PMD_PACKET_PREFETCH ifdef?
> 
> But since config option is enabled by default, performance concern is still
> valid and needs to be investigated.
> 
> Thanks,
> ferruh

Hi Ferruh,

I'll update my patch according to code review suggestions.

Regards,
Vlad

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-11-15 13:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 17:27 [dpdk-dev] [PATCH] Add missing prefetches to i40e bulk rx path Vladyslav Buslov
2016-07-14 17:27 ` [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx Vladyslav Buslov
2016-09-14 13:24   ` Ferruh Yigit
2016-10-10 13:25     ` Wu, Jingjing
2016-10-10 17:05       ` Vladyslav Buslov
2016-10-11  8:51         ` Ananyev, Konstantin
2016-10-11  9:24           ` Vladyslav Buslov
2016-10-12  0:04             ` Ananyev, Konstantin
2016-10-13 10:18               ` Bruce Richardson
2016-10-13 10:30                 ` Ananyev, Konstantin
2016-11-15 12:19                   ` Ferruh Yigit
2016-11-15 13:27                     ` Vladyslav Buslov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).