[PATCH] examples/l3fwd: optimize packet prefetch

DPDK patches and discussions
 help / color / mirror / Atom feed

* [PATCH] examples/l3fwd: optimize packet prefetch
@ 2024-12-25  7:53 Dengdui Huang
  2024-12-25 21:21 ` Stephen Hemminger
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dengdui Huang @ 2024-12-25  7:53 UTC (permalink / raw)
  To: dev
  Cc: wathsala.vithanage, stephen, liuyonglong, fengchengwen, haijie1,
	lihuisong

The prefetch window depending on the hardware platform. The current prefetch
policy may not be applicable to all platforms. In most cases, the number of
packets received by Rx burst is small (64 is used in most performance reports).
In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
packets before processing can achieve better performance.

Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
 1 file changed, 5 insertions(+), 37 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index 3c1f827424..0b51782b8c 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
 	const int32_t m = nb_rx % FWDSTEP;
 
-	if (k) {
-		for (i = 0; i < FWDSTEP; i++) {
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
-							void *));
-		}
-		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
-			for (i = 0; i < FWDSTEP; i++) {
-				rte_prefetch0(rte_pktmbuf_mtod(
-						pkts_burst[j + i + FWDSTEP],
-						void *));
-			}
+	/* The number of packets is small. Prefetch all packets. */
+	for (i = 0; i < nb_rx; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
 
+	if (k) {
+		for (j = 0; j != k; j += FWDSTEP) {
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
 			processx4_step2(qconf, dip, ipv4_flag, portid,
 					&pkts_burst[j], &dst_port[j]);
 			if (do_step3)
 				processx4_step3(&pkts_burst[j], &dst_port[j]);
 		}
-
-		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
-		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
-				&dst_port[j]);
-		if (do_step3)
-			processx4_step3(&pkts_burst[j], &dst_port[j]);
-
-		j += FWDSTEP;
 	}
 
 	if (m) {
-		/* Prefetch last up to 3 packets one by one */
-		switch (m) {
-		case 3:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 2:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 1:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-		}
-		j -= m;
 		/* Classify last up to 3 packets one by one */
 		switch (m) {
 		case 3:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: optimize packet prefetch
  2024-12-25  7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
@ 2024-12-25 21:21 ` Stephen Hemminger
  2025-01-08 13:42 ` Konstantin Ananyev
  2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
  2 siblings, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2024-12-25 21:21 UTC (permalink / raw)
  To: Dengdui Huang
  Cc: dev, wathsala.vithanage, liuyonglong, fengchengwen, haijie1, lihuisong

On Wed, 25 Dec 2024 15:53:02 +0800
Dengdui Huang <huangdengdui@huawei.com> wrote:

> From: Dengdui Huang <huangdengdui@huawei.com>
> To: <dev@dpdk.org>
> CC: <wathsala.vithanage@arm.com>, <stephen@networkplumber.org>,  <liuyonglong@huawei.com>, <fengchengwen@huawei.com>, <haijie1@huawei.com>,  <lihuisong@huawei.com>
> Subject: [PATCH] examples/l3fwd: optimize packet prefetch
> Date: Wed, 25 Dec 2024 15:53:02 +0800
> X-Mailer: git-send-email 2.33.0
> 
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.
> 
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---

I think Vpp had a good description of how to unroll and deal with prefetch.

With larger burst sizes you don't want to prefetch the whole burst.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] examples/l3fwd: optimize packet prefetch
  2024-12-25  7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
  2024-12-25 21:21 ` Stephen Hemminger
@ 2025-01-08 13:42 ` Konstantin Ananyev
  2025-01-09 11:31   ` huangdengdui
  2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
  2 siblings, 1 reply; 12+ messages in thread
From: Konstantin Ananyev @ 2025-01-08 13:42 UTC (permalink / raw)
  To: huangdengdui, dev
  Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie,
	lihuisong (C)



> 
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.

As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
So it could easily be that changes you suggesting will cause performance
boost on one platform and degradation on another.
In fact, right now l3fwd 'prefetch' usage is a bit of mess:
- l3fwd_lpm_neon.h uses  FWDSTEP as a prefetch window.
- l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
- rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
 
Probably what we need here is some unified approach:
configurable at run-time prefetch_window_size that all code-paths will obey. 

> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
>  examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>  1 file changed, 5 insertions(+), 37 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index 3c1f827424..0b51782b8c 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>  	const int32_t m = nb_rx % FWDSTEP;
> 
> -	if (k) {
> -		for (i = 0; i < FWDSTEP; i++) {
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> -							void *));
> -		}
> -		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> -			for (i = 0; i < FWDSTEP; i++) {
> -				rte_prefetch0(rte_pktmbuf_mtod(
> -						pkts_burst[j + i + FWDSTEP],
> -						void *));
> -			}
> +	/* The number of packets is small. Prefetch all packets. */
> +	for (i = 0; i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
> 
> +	if (k) {
> +		for (j = 0; j != k; j += FWDSTEP) {
>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>  			processx4_step2(qconf, dip, ipv4_flag, portid,
>  					&pkts_burst[j], &dst_port[j]);
>  			if (do_step3)
>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>  		}
> -
> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
> -				&dst_port[j]);
> -		if (do_step3)
> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
> -
> -		j += FWDSTEP;
>  	}
> 
>  	if (m) {
> -		/* Prefetch last up to 3 packets one by one */
> -		switch (m) {
> -		case 3:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 2:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 1:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -		}
> -		j -= m;
>  		/* Classify last up to 3 packets one by one */
>  		switch (m) {
>  		case 3:
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: optimize packet prefetch
  2025-01-08 13:42 ` Konstantin Ananyev
@ 2025-01-09 11:31   ` huangdengdui
  0 siblings, 0 replies; 12+ messages in thread
From: huangdengdui @ 2025-01-09 11:31 UTC (permalink / raw)
  To: Konstantin Ananyev, dev
  Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie,
	lihuisong (C)


On 2025/1/8 21:42, Konstantin Ananyev wrote:
> 
> 
>>
>> The prefetch window depending on the hardware platform. The current prefetch
>> policy may not be applicable to all platforms. In most cases, the number of
>> packets received by Rx burst is small (64 is used in most performance reports).
>> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
>> packets before processing can achieve better performance.
> 
> As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
> So it could easily be that changes you suggesting will cause performance
> boost on one platform and degradation on another.
> In fact, right now l3fwd 'prefetch' usage is a bit of mess:
> - l3fwd_lpm_neon.h uses  FWDSTEP as a prefetch window.
> - l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
> - rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
>  
> Probably what we need here is some unified approach:
> configurable at run-time prefetch_window_size that all code-paths will obey. 

Agreed, I'll add a parameter to configure the prefetch window.

> 
>> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
>> ---
>>  examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>>  1 file changed, 5 insertions(+), 37 deletions(-)
>>
>> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
>> index 3c1f827424..0b51782b8c 100644
>> --- a/examples/l3fwd/l3fwd_lpm_neon.h
>> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
>> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>>  	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>>  	const int32_t m = nb_rx % FWDSTEP;
>>
>> -	if (k) {
>> -		for (i = 0; i < FWDSTEP; i++) {
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
>> -							void *));
>> -		}
>> -		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
>> -			for (i = 0; i < FWDSTEP; i++) {
>> -				rte_prefetch0(rte_pktmbuf_mtod(
>> -						pkts_burst[j + i + FWDSTEP],
>> -						void *));
>> -			}
>> +	/* The number of packets is small. Prefetch all packets. */
>> +	for (i = 0; i < nb_rx; i++)
>> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
>>
>> +	if (k) {
>> +		for (j = 0; j != k; j += FWDSTEP) {
>>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>>  			processx4_step2(qconf, dip, ipv4_flag, portid,
>>  					&pkts_burst[j], &dst_port[j]);
>>  			if (do_step3)
>>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>>  		}
>> -
>> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
>> -				&dst_port[j]);
>> -		if (do_step3)
>> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
>> -
>> -		j += FWDSTEP;
>>  	}
>>
>>  	if (m) {
>> -		/* Prefetch last up to 3 packets one by one */
>> -		switch (m) {
>> -		case 3:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -			/* fallthrough */
>> -		case 2:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -			/* fallthrough */
>> -		case 1:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -		}
>> -		j -= m;
>>  		/* Classify last up to 3 packets one by one */
>>  		switch (m) {
>>  		case 3:
>> --
>> 2.33.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] examples/l3fwd: add option to set refetch offset
  2024-12-25  7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
  2024-12-25 21:21 ` Stephen Hemminger
  2025-01-08 13:42 ` Konstantin Ananyev
@ 2025-01-10  9:37 ` Dengdui Huang
  2025-01-10 17:19   ` Stephen Hemminger
                     ` (2 more replies)
  2 siblings, 3 replies; 12+ messages in thread
From: Dengdui Huang @ 2025-01-10  9:37 UTC (permalink / raw)
  To: dev
  Cc: konstantin.ananyev, wathsala.vithanage, stephen, lihuisong,
	fengchengwen, haijie1, liuyonglong

The prefetch window depending on the HW platform. It is difficult to
measure the prefetch window of a HW platform. Therefore, the prefetch
offset option is added to change the prefetch window. User can adjust
the refetch offset to achieve the best prefetch effect.

In addition, this option is used only in the main loop.

Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
---
 examples/l3fwd/l3fwd.h               |  6 ++-
 examples/l3fwd/l3fwd_acl_scalar.h    |  6 +--
 examples/l3fwd/l3fwd_em.h            | 18 ++++-----
 examples/l3fwd/l3fwd_em_hlm.h        |  9 +++--
 examples/l3fwd/l3fwd_em_sequential.h | 60 ++++++++++++++++------------
 examples/l3fwd/l3fwd_fib.c           | 21 +++++-----
 examples/l3fwd/l3fwd_lpm.h           |  6 +--
 examples/l3fwd/l3fwd_lpm_neon.h      | 45 ++++-----------------
 examples/l3fwd/main.c                | 14 +++++++
 9 files changed, 91 insertions(+), 94 deletions(-)

diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
index 0cce3406ee..2272fb2870 100644
--- a/examples/l3fwd/l3fwd.h
+++ b/examples/l3fwd/l3fwd.h
@@ -39,8 +39,7 @@
 
 #define NB_SOCKETS        8
 
-/* Configure how many packets ahead to prefetch, when reading packets */
-#define PREFETCH_OFFSET	  3
+#define DEFAULT_PREFECH_OFFSET 4
 
 /* Used to mark destination port as 'invalid'. */
 #define	BAD_PORT ((uint16_t)-1)
@@ -119,6 +118,9 @@ extern uint32_t max_pkt_len;
 extern uint32_t nb_pkt_per_burst;
 extern uint32_t mb_mempool_cache_size;
 
+/* Prefetch offset of packets processed by the main loop. */
+extern uint16_t prefetch_offset;
+
 /* Send burst of packets on an output interface */
 static inline int
 send_burst(struct lcore_conf *qconf, uint16_t n, uint16_t port)
diff --git a/examples/l3fwd/l3fwd_acl_scalar.h b/examples/l3fwd/l3fwd_acl_scalar.h
index cb22bb49aa..d00730ff25 100644
--- a/examples/l3fwd/l3fwd_acl_scalar.h
+++ b/examples/l3fwd/l3fwd_acl_scalar.h
@@ -72,14 +72,14 @@ l3fwd_acl_prepare_acl_parameter(struct rte_mbuf **pkts_in, struct acl_search_t *
 	acl->num_ipv6 = 0;
 
 	/* Prefetch first packets */
-	for (i = 0; i < PREFETCH_OFFSET && i < nb_rx; i++) {
+	for (i = 0; i < prefetch_offset && i < nb_rx; i++) {
 		rte_prefetch0(rte_pktmbuf_mtod(
 				pkts_in[i], void *));
 	}
 
-	for (i = 0; i < (nb_rx - PREFETCH_OFFSET); i++) {
+	for (i = 0; i < (nb_rx - prefetch_offset); i++) {
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_in[
-				i + PREFETCH_OFFSET], void *));
+				i + prefetch_offset], void *));
 		l3fwd_acl_prepare_one_packet(pkts_in, acl, i);
 	}
 
diff --git a/examples/l3fwd/l3fwd_em.h b/examples/l3fwd/l3fwd_em.h
index 1fee2e2e6c..3ef32c9053 100644
--- a/examples/l3fwd/l3fwd_em.h
+++ b/examples/l3fwd/l3fwd_em.h
@@ -132,16 +132,16 @@ l3fwd_em_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	int32_t j;
 
 	/* Prefetch first packets */
-	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
 
 	/*
 	 * Prefetch and forward already prefetched
 	 * packets.
 	 */
-	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
-				j + PREFETCH_OFFSET], void *));
+				j + prefetch_offset], void *));
 		l3fwd_em_simple_forward(pkts_burst[j], portid, qconf);
 	}
 
@@ -161,16 +161,16 @@ l3fwd_em_no_opt_process_events(int nb_rx, struct rte_event **events,
 	int32_t j;
 
 	/* Prefetch first packets */
-	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
 		rte_prefetch0(rte_pktmbuf_mtod(events[j]->mbuf, void *));
 
 	/*
 	 * Prefetch and forward already prefetched
 	 * packets.
 	 */
-	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
 		rte_prefetch0(rte_pktmbuf_mtod(events[
-				j + PREFETCH_OFFSET]->mbuf, void *));
+				j + prefetch_offset]->mbuf, void *));
 		l3fwd_em_simple_process(events[j]->mbuf, qconf);
 	}
 
@@ -188,15 +188,15 @@ l3fwd_em_no_opt_process_event_vector(struct rte_event_vector *vec,
 	int32_t i;
 
 	/* Prefetch first packets */
-	for (i = 0; i < PREFETCH_OFFSET && i < vec->nb_elem; i++)
+	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
 		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
 
 	/*
 	 * Prefetch and forward already prefetched packets.
 	 */
-	for (i = 0; i < (vec->nb_elem - PREFETCH_OFFSET); i++) {
+	for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
 		rte_prefetch0(
-			rte_pktmbuf_mtod(mbufs[i + PREFETCH_OFFSET], void *));
+			rte_pktmbuf_mtod(mbufs[i + prefetch_offset], void *));
 		dst_ports[i] = l3fwd_em_simple_process(mbufs[i], qconf);
 	}
 
diff --git a/examples/l3fwd/l3fwd_em_hlm.h b/examples/l3fwd/l3fwd_em_hlm.h
index c1d819997a..764527962b 100644
--- a/examples/l3fwd/l3fwd_em_hlm.h
+++ b/examples/l3fwd/l3fwd_em_hlm.h
@@ -190,7 +190,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	 */
 	int32_t n = RTE_ALIGN_FLOOR(nb_rx, EM_HASH_LOOKUP_COUNT);
 
-	for (j = 0; j < EM_HASH_LOOKUP_COUNT && j < nb_rx; j++) {
+	for (j = 0; j < prefetch_offset && j < nb_rx; j++) {
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
 					       struct rte_ether_hdr *) + 1);
 	}
@@ -207,7 +207,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		l3_type = pkt_type & RTE_PTYPE_L3_MASK;
 		tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
 
-		for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
+		for (i = 0, pos = j + prefetch_offset;
 		     i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
 			rte_prefetch0(rte_pktmbuf_mtod(
 					pkts_burst[pos],
@@ -277,6 +277,9 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
 	for (j = 0; j < nb_rx; j++)
 		pkts_burst[j] = ev[j]->mbuf;
 
+	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], struct rte_ether_hdr *) + 1);
+
 	for (j = 0; j < n; j += EM_HASH_LOOKUP_COUNT) {
 
 		uint32_t pkt_type = RTE_PTYPE_L3_MASK |
@@ -289,7 +292,7 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
 		l3_type = pkt_type & RTE_PTYPE_L3_MASK;
 		tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
 
-		for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
+		for (i = 0, pos = j + prefetch_offset;
 		     i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
 			rte_prefetch0(rte_pktmbuf_mtod(
 					pkts_burst[pos],
diff --git a/examples/l3fwd/l3fwd_em_sequential.h b/examples/l3fwd/l3fwd_em_sequential.h
index 3a40b2e434..f2c6ceb7c0 100644
--- a/examples/l3fwd/l3fwd_em_sequential.h
+++ b/examples/l3fwd/l3fwd_em_sequential.h
@@ -81,20 +81,19 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	int32_t i, j;
 	uint16_t dst_port[SENDM_PORT_OVERHEAD(MAX_PKT_BURST)];
 
-	if (nb_rx > 0) {
-		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0],
+	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
 					       struct rte_ether_hdr *) + 1);
-	}
 
-	for (i = 1, j = 0; j < nb_rx; i++, j++) {
-		if (i < nb_rx) {
-			rte_prefetch0(rte_pktmbuf_mtod(
-					pkts_burst[i],
-					struct rte_ether_hdr *) + 1);
-		}
+	for (j = 0; j < nb_rx - prefetch_offset; j++) {
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j + prefetch_offset],
+					       struct rte_ether_hdr *) + 1);
 		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
 	}
 
+	for (; j < nb_rx; j++)
+		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
+
 	send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
 }
 
@@ -106,20 +105,26 @@ static inline void
 l3fwd_em_process_events(int nb_rx, struct rte_event **events,
 		     struct lcore_conf *qconf)
 {
+	struct rte_mbuf *mbuf;
+	uint16_t port;
 	int32_t i, j;
 
-	rte_prefetch0(rte_pktmbuf_mtod(events[0]->mbuf,
-		      struct rte_ether_hdr *) + 1);
+	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(events[i]->mbuf, struct rte_ether_hdr *) + 1);
 
-	for (i = 1, j = 0; j < nb_rx; i++, j++) {
-		struct rte_mbuf *mbuf = events[j]->mbuf;
-		uint16_t port;
+	for (j = 0; j < nb_rx - prefetch_offset; j++) {
+		rte_prefetch0(rte_pktmbuf_mtod(events[j + prefetch_offset]->mbuf,
+					       struct rte_ether_hdr *) + 1);
+		mbuf = events[j]->mbuf;
+		port = mbuf->port;
+		mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
+		process_packet(mbuf, &mbuf->port);
+		if (mbuf->port == BAD_PORT)
+			mbuf->port = port;
+	}
 
-		if (i < nb_rx) {
-			rte_prefetch0(rte_pktmbuf_mtod(
-					events[i]->mbuf,
-					struct rte_ether_hdr *) + 1);
-		}
+	for (; j < nb_rx; j++) {
+		mbuf = events[j]->mbuf;
 		port = mbuf->port;
 		mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
 		process_packet(mbuf, &mbuf->port);
@@ -136,17 +141,22 @@ l3fwd_em_process_event_vector(struct rte_event_vector *vec,
 	struct rte_mbuf **mbufs = vec->mbufs;
 	int32_t i, j;
 
-	rte_prefetch0(rte_pktmbuf_mtod(mbufs[0], struct rte_ether_hdr *) + 1);
+	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr *) + 1);
 
-	for (i = 0, j = 1; i < vec->nb_elem; i++, j++) {
-		if (j < vec->nb_elem)
-			rte_prefetch0(rte_pktmbuf_mtod(mbufs[j],
-						       struct rte_ether_hdr *) +
-				      1);
+	for (i = 0; i < vec->nb_elem - prefetch_offset; i++) {
+		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
+					       struct rte_ether_hdr *) + 1);
 		dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
 					       attr_valid ? vec->port :
 							    mbufs[i]->port);
 	}
+
+	for (; i < vec->nb_elem; i++)
+		dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
+					       attr_valid ? vec->port :
+							    mbufs[i]->port);
+
 	j = RTE_ALIGN_FLOOR(vec->nb_elem, FWDSTEP);
 
 	for (i = 0; i != j; i += FWDSTEP)
diff --git a/examples/l3fwd/l3fwd_fib.c b/examples/l3fwd/l3fwd_fib.c
index 82f1739df7..25192611c5 100644
--- a/examples/l3fwd/l3fwd_fib.c
+++ b/examples/l3fwd/l3fwd_fib.c
@@ -24,9 +24,6 @@
 #include "l3fwd_event.h"
 #include "l3fwd_route.h"
 
-/* Configure how many packets ahead to prefetch for fib. */
-#define FIB_PREFETCH_OFFSET 4
-
 /* A non-existent portid is needed to denote a default hop for fib. */
 #define FIB_DEFAULT_HOP 999
 
@@ -130,14 +127,14 @@ fib_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	int32_t i;
 
 	/* Prefetch first packets. */
-	for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_rx; i++)
+	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
 
 	/* Parse packet info and prefetch. */
-	for (i = 0; i < (nb_rx - FIB_PREFETCH_OFFSET); i++) {
+	for (i = 0; i < (nb_rx - prefetch_offset); i++) {
 		/* Prefetch packet. */
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
-				i + FIB_PREFETCH_OFFSET], void *));
+				i + prefetch_offset], void *));
 		fib_parse_packet(pkts_burst[i],
 				&ipv4_arr[ipv4_cnt], &ipv4_cnt,
 				&ipv6_arr[ipv6_cnt], &ipv6_cnt,
@@ -302,11 +299,11 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
 		ipv6_arr_assem = 0;
 
 		/* Prefetch first packets. */
-		for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_deq; i++)
+		for (i = 0; i < prefetch_offset && i < nb_deq; i++)
 			rte_prefetch0(rte_pktmbuf_mtod(events[i].mbuf, void *));
 
 		/* Parse packet info and prefetch. */
-		for (i = 0; i < (nb_deq - FIB_PREFETCH_OFFSET); i++) {
+		for (i = 0; i < (nb_deq - prefetch_offset); i++) {
 			if (flags & L3FWD_EVENT_TX_ENQ) {
 				events[i].queue_id = tx_q_id;
 				events[i].op = RTE_EVENT_OP_FORWARD;
@@ -318,7 +315,7 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
 
 			/* Prefetch packet. */
 			rte_prefetch0(rte_pktmbuf_mtod(events[
-					i + FIB_PREFETCH_OFFSET].mbuf,
+					i + prefetch_offset].mbuf,
 					void *));
 
 			fib_parse_packet(events[i].mbuf,
@@ -455,12 +452,12 @@ fib_process_event_vector(struct rte_event_vector *vec, uint8_t *type_arr,
 	ipv6_arr_assem = 0;
 
 	/* Prefetch first packets. */
-	for (i = 0; i < FIB_PREFETCH_OFFSET && i < vec->nb_elem; i++)
+	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
 		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
 
 	/* Parse packet info and prefetch. */
-	for (i = 0; i < (vec->nb_elem - FIB_PREFETCH_OFFSET); i++) {
-		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + FIB_PREFETCH_OFFSET],
+	for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
+		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
 					       void *));
 		fib_parse_packet(mbufs[i], &ipv4_arr[ipv4_cnt], &ipv4_cnt,
 				 &ipv6_arr[ipv6_cnt], &ipv6_cnt, &type_arr[i]);
diff --git a/examples/l3fwd/l3fwd_lpm.h b/examples/l3fwd/l3fwd_lpm.h
index 4ee61e8d88..d81aa2efaf 100644
--- a/examples/l3fwd/l3fwd_lpm.h
+++ b/examples/l3fwd/l3fwd_lpm.h
@@ -82,13 +82,13 @@ l3fwd_lpm_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	int32_t j;
 
 	/* Prefetch first packets */
-	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
 
 	/* Prefetch and forward already prefetched packets. */
-	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
-				j + PREFETCH_OFFSET], void *));
+				j + prefetch_offset], void *));
 		l3fwd_lpm_simple_forward(pkts_burst[j], portid, qconf);
 	}
 
diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index 3c1f827424..5570a11687 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -85,23 +85,20 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 			  uint16_t portid, uint16_t *dst_port,
 			  struct lcore_conf *qconf, const uint8_t do_step3)
 {
-	int32_t i = 0, j = 0;
+	int32_t i = 0, j = 0, pos = 0;
 	int32x4_t dip;
 	uint32_t ipv4_flag;
 	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
 	const int32_t m = nb_rx % FWDSTEP;
 
 	if (k) {
-		for (i = 0; i < FWDSTEP; i++) {
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
-							void *));
-		}
-		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
-			for (i = 0; i < FWDSTEP; i++) {
-				rte_prefetch0(rte_pktmbuf_mtod(
-						pkts_burst[j + i + FWDSTEP],
-						void *));
-			}
+		for (i = 0; i < prefetch_offset && i < k; i++)
+			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
+
+		for (j = 0; j != k; j += FWDSTEP) {
+			for (i = 0, pos = j + prefetch_offset;
+			     i < FWDSTEP && pos < k; i++, pos++)
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
 
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
 			processx4_step2(qconf, dip, ipv4_flag, portid,
@@ -109,35 +106,9 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 			if (do_step3)
 				processx4_step3(&pkts_burst[j], &dst_port[j]);
 		}
-
-		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
-		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
-				&dst_port[j]);
-		if (do_step3)
-			processx4_step3(&pkts_burst[j], &dst_port[j]);
-
-		j += FWDSTEP;
 	}
 
 	if (m) {
-		/* Prefetch last up to 3 packets one by one */
-		switch (m) {
-		case 3:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 2:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 1:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-		}
-		j -= m;
 		/* Classify last up to 3 packets one by one */
 		switch (m) {
 		case 3:
diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 994b7dd8e5..0920c0b2f6 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -59,6 +59,7 @@ uint16_t nb_rxd = RX_DESC_DEFAULT;
 uint16_t nb_txd = TX_DESC_DEFAULT;
 uint32_t nb_pkt_per_burst = DEFAULT_PKT_BURST;
 uint32_t mb_mempool_cache_size = MEMPOOL_CACHE_SIZE;
+uint16_t prefetch_offset = DEFAULT_PREFECH_OFFSET;
 
 /**< Ports set in promiscuous mode off by default. */
 static int promiscuous_on;
@@ -769,6 +770,7 @@ static const char short_options[] =
 #define CMD_LINE_OPT_ALG "alg"
 #define CMD_LINE_OPT_PKT_BURST "burst"
 #define CMD_LINE_OPT_MB_CACHE_SIZE "mbcache"
+#define CMD_PREFETCH_OFFSET "prefetch-offset"
 
 enum {
 	/* long options mapped to a short option */
@@ -800,6 +802,7 @@ enum {
 	CMD_LINE_OPT_VECTOR_TMO_NS_NUM,
 	CMD_LINE_OPT_PKT_BURST_NUM,
 	CMD_LINE_OPT_MB_CACHE_SIZE_NUM,
+	CMD_PREFETCH_OFFSET_NUM,
 };
 
 static const struct option lgopts[] = {
@@ -828,6 +831,7 @@ static const struct option lgopts[] = {
 	{CMD_LINE_OPT_ALG,   1, 0, CMD_LINE_OPT_ALG_NUM},
 	{CMD_LINE_OPT_PKT_BURST,   1, 0, CMD_LINE_OPT_PKT_BURST_NUM},
 	{CMD_LINE_OPT_MB_CACHE_SIZE,   1, 0, CMD_LINE_OPT_MB_CACHE_SIZE_NUM},
+	{CMD_PREFETCH_OFFSET,   1, 0, CMD_PREFETCH_OFFSET_NUM},
 	{NULL, 0, 0, 0}
 };
 
@@ -1017,6 +1021,9 @@ parse_args(int argc, char **argv)
 		case CMD_LINE_OPT_ALG_NUM:
 			l3fwd_set_alg(optarg);
 			break;
+		case CMD_PREFETCH_OFFSET_NUM:
+			prefetch_offset = strtol(optarg, NULL, 10);
+			break;
 		default:
 			print_usage(prgname);
 			return -1;
@@ -1054,6 +1061,13 @@ parse_args(int argc, char **argv)
 	}
 #endif
 
+	if (prefetch_offset > nb_pkt_per_burst) {
+		fprintf(stderr, "Prefetch offset (%u) cannot be greater than burst size (%u). "
+			"Using burst size %u.\n",
+			prefetch_offset, nb_pkt_per_burst, nb_pkt_per_burst);
+		prefetch_offset = nb_pkt_per_burst;
+	}
+
 	/*
 	 * Nothing is selected, pick longest-prefix match
 	 * as default match.
-- 
2.33.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
@ 2025-01-10 17:19   ` Stephen Hemminger
  2025-01-14  9:23     ` huangdengdui
  2025-01-10 17:20   ` Stephen Hemminger
  2025-02-19 16:44   ` Konstantin Ananyev
  2 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2025-01-10 17:19 UTC (permalink / raw)
  To: Dengdui Huang
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong

On Fri, 10 Jan 2025 17:37:15 +0800
Dengdui Huang <huangdengdui@huawei.com> wrote:

> +#define DEFAULT_PREFECH_OFFSET 4

Spelling

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
  2025-01-10 17:19   ` Stephen Hemminger
@ 2025-01-10 17:20   ` Stephen Hemminger
  2025-01-14  9:22     ` huangdengdui
  2025-02-19 16:44   ` Konstantin Ananyev
  2 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2025-01-10 17:20 UTC (permalink / raw)
  To: Dengdui Huang
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong

On Fri, 10 Jan 2025 17:37:15 +0800
Dengdui Huang <huangdengdui@huawei.com> wrote:

> The prefetch window depending on the HW platform. It is difficult to
> measure the prefetch window of a HW platform. Therefore, the prefetch
> offset option is added to change the prefetch window. User can adjust
> the refetch offset to achieve the best prefetch effect.
> 
> In addition, this option is used only in the main loop.
> 
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---

This will make it slower for many platforms.
GCC will unroll a loop of fixed small size, which is what we want.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-10 17:20   ` Stephen Hemminger
@ 2025-01-14  9:22     ` huangdengdui
  2025-01-14 16:07       ` Stephen Hemminger
  0 siblings, 1 reply; 12+ messages in thread
From: huangdengdui @ 2025-01-14  9:22 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong


On 2025/1/11 1:20, Stephen Hemminger wrote:
> This will make it slower for many platforms.
> GCC will unroll a loop of fixed small size, which is what we want.

Do you mean to replace option with a macro?
But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options.

const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
for (j = 0; j != k; j += FWDSTEP) {
	for (i = 0, pos = j + prefetch_offset;
	     i < FWDSTEP && pos < k; i++, pos++)
		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
	processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
	processx4_step2(qconf, dip, ipv4_flag, portid,
			&pkts_burst[j], &dst_port[j]);
	if (do_step3)
		processx4_step3(&pkts_burst[j], &dst_port[j]);
}

The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform.
So I think it's better to use option.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-10 17:19   ` Stephen Hemminger
@ 2025-01-14  9:23     ` huangdengdui
  0 siblings, 0 replies; 12+ messages in thread
From: huangdengdui @ 2025-01-14  9:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong



On 2025/1/11 1:19, Stephen Hemminger wrote:
> On Fri, 10 Jan 2025 17:37:15 +0800
> Dengdui Huang <huangdengdui@huawei.com> wrote:
> 
>> +#define DEFAULT_PREFECH_OFFSET 4
> 
> Spelling

I made a mistake. I'll fix it for the next version.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-14  9:22     ` huangdengdui
@ 2025-01-14 16:07       ` Stephen Hemminger
  2025-01-14 16:11         ` Stephen Hemminger
  0 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2025-01-14 16:07 UTC (permalink / raw)
  To: huangdengdui
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong

On Tue, 14 Jan 2025 17:22:08 +0800
huangdengdui <huangdengdui@huawei.com> wrote:

> On 2025/1/11 1:20, Stephen Hemminger wrote:
> > This will make it slower for many platforms.
> > GCC will unroll a loop of fixed small size, which is what we want.  
> 
> Do you mean to replace option with a macro?
> But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options.
> 
> const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
> for (j = 0; j != k; j += FWDSTEP) {
> 	for (i = 0, pos = j + prefetch_offset;
> 	     i < FWDSTEP && pos < k; i++, pos++)
> 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
> 	processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> 	processx4_step2(qconf, dip, ipv4_flag, portid,
> 			&pkts_burst[j], &dst_port[j]);
> 	if (do_step3)
> 		processx4_step3(&pkts_burst[j], &dst_port[j]);
> }
> 
> The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform.
> So I think it's better to use option.

The tradeoff is that loop unrolling most often is only done on small fix sized loops.
And the cost of a loop with variable small values (branch prediction) is high enough that it 
could make things slower.

Prefetching is a balancing act, and more is not better especially on real workloads.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-14 16:07       ` Stephen Hemminger
@ 2025-01-14 16:11         ` Stephen Hemminger
  0 siblings, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-01-14 16:11 UTC (permalink / raw)
  To: huangdengdui
  Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong,
	fengchengwen, haijie1, liuyonglong

On Tue, 14 Jan 2025 08:07:52 -0800
Stephen Hemminger <stephen@networkplumber.org> wrote:

> On Tue, 14 Jan 2025 17:22:08 +0800
> huangdengdui <huangdengdui@huawei.com> wrote:
> 
> > On 2025/1/11 1:20, Stephen Hemminger wrote:  
> > > This will make it slower for many platforms.
> > > GCC will unroll a loop of fixed small size, which is what we want.    
> > 
> > Do you mean to replace option with a macro?
> > But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options.
> > 
> > const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
> > for (j = 0; j != k; j += FWDSTEP) {
> > 	for (i = 0, pos = j + prefetch_offset;
> > 	     i < FWDSTEP && pos < k; i++, pos++)
> > 		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
> > 	processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> > 	processx4_step2(qconf, dip, ipv4_flag, portid,
> > 			&pkts_burst[j], &dst_port[j]);
> > 	if (do_step3)
> > 		processx4_step3(&pkts_burst[j], &dst_port[j]);
> > }
> > 
> > The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform.
> > So I think it's better to use option.  
> 
> The tradeoff is that loop unrolling most often is only done on small fix sized loops.
> And the cost of a loop with variable small values (branch prediction) is high enough that it 
> could make things slower.
> 
> Prefetching is a balancing act, and more is not better especially on real workloads.

You might also want to look at the quad loop model used in VPP for prefetching.

https://my-vpp-docs.readthedocs.io/en/latest/gettingstarted/developers/vnet.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] examples/l3fwd: add option to set refetch offset
  2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
  2025-01-10 17:19   ` Stephen Hemminger
  2025-01-10 17:20   ` Stephen Hemminger
@ 2025-02-19 16:44   ` Konstantin Ananyev
  2 siblings, 0 replies; 12+ messages in thread
From: Konstantin Ananyev @ 2025-02-19 16:44 UTC (permalink / raw)
  To: huangdengdui, dev
  Cc: wathsala.vithanage, stephen, lihuisong (C),
	Fengchengwen, haijie, liuyonglong


> Subject: [PATCH] examples/l3fwd: add option to set refetch offset

I suppose it should be 'prefetch'.
 
> The prefetch window depending on the HW platform. It is difficult to
> measure the prefetch window of a HW platform. Therefore, the prefetch
> offset option is added to change the prefetch window. User can adjust
> the refetch offset to achieve the best prefetch effect.
> 
> In addition, this option is used only in the main loop.

I run few tests for fib,lpm,acl modes on my Intel ICX box.
Didn't notice any performance drop with that patch.

> 
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
>  examples/l3fwd/l3fwd.h               |  6 ++-
>  examples/l3fwd/l3fwd_acl_scalar.h    |  6 +--
>  examples/l3fwd/l3fwd_em.h            | 18 ++++-----
>  examples/l3fwd/l3fwd_em_hlm.h        |  9 +++--
>  examples/l3fwd/l3fwd_em_sequential.h | 60 ++++++++++++++++------------
>  examples/l3fwd/l3fwd_fib.c           | 21 +++++-----
>  examples/l3fwd/l3fwd_lpm.h           |  6 +--
>  examples/l3fwd/l3fwd_lpm_neon.h      | 45 ++++-----------------
>  examples/l3fwd/main.c                | 14 +++++++
>  9 files changed, 91 insertions(+), 94 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
> index 0cce3406ee..2272fb2870 100644
> --- a/examples/l3fwd/l3fwd.h
> +++ b/examples/l3fwd/l3fwd.h
> @@ -39,8 +39,7 @@
> 
>  #define NB_SOCKETS        8
> 
> -/* Configure how many packets ahead to prefetch, when reading packets */
> -#define PREFETCH_OFFSET	  3
> +#define DEFAULT_PREFECH_OFFSET 4
> 
>  /* Used to mark destination port as 'invalid'. */
>  #define	BAD_PORT ((uint16_t)-1)
> @@ -119,6 +118,9 @@ extern uint32_t max_pkt_len;
>  extern uint32_t nb_pkt_per_burst;
>  extern uint32_t mb_mempool_cache_size;
> 
> +/* Prefetch offset of packets processed by the main loop. */
> +extern uint16_t prefetch_offset;
> +
>  /* Send burst of packets on an output interface */
>  static inline int
>  send_burst(struct lcore_conf *qconf, uint16_t n, uint16_t port)
> diff --git a/examples/l3fwd/l3fwd_acl_scalar.h b/examples/l3fwd/l3fwd_acl_scalar.h
> index cb22bb49aa..d00730ff25 100644
> --- a/examples/l3fwd/l3fwd_acl_scalar.h
> +++ b/examples/l3fwd/l3fwd_acl_scalar.h
> @@ -72,14 +72,14 @@ l3fwd_acl_prepare_acl_parameter(struct rte_mbuf **pkts_in, struct acl_search_t *
>  	acl->num_ipv6 = 0;
> 
>  	/* Prefetch first packets */
> -	for (i = 0; i < PREFETCH_OFFSET && i < nb_rx; i++) {
> +	for (i = 0; i < prefetch_offset && i < nb_rx; i++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(
>  				pkts_in[i], void *));
>  	}
> 
> -	for (i = 0; i < (nb_rx - PREFETCH_OFFSET); i++) {
> +	for (i = 0; i < (nb_rx - prefetch_offset); i++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_in[
> -				i + PREFETCH_OFFSET], void *));
> +				i + prefetch_offset], void *));
>  		l3fwd_acl_prepare_one_packet(pkts_in, acl, i);
>  	}
> 
> diff --git a/examples/l3fwd/l3fwd_em.h b/examples/l3fwd/l3fwd_em.h
> index 1fee2e2e6c..3ef32c9053 100644
> --- a/examples/l3fwd/l3fwd_em.h
> +++ b/examples/l3fwd/l3fwd_em.h
> @@ -132,16 +132,16 @@ l3fwd_em_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	int32_t j;
> 
>  	/* Prefetch first packets */
> -	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
> +	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
> 
>  	/*
>  	 * Prefetch and forward already prefetched
>  	 * packets.
>  	 */
> -	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> -				j + PREFETCH_OFFSET], void *));
> +				j + prefetch_offset], void *));
>  		l3fwd_em_simple_forward(pkts_burst[j], portid, qconf);
>  	}
> 
> @@ -161,16 +161,16 @@ l3fwd_em_no_opt_process_events(int nb_rx, struct rte_event **events,
>  	int32_t j;
> 
>  	/* Prefetch first packets */
> -	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
> +	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
>  		rte_prefetch0(rte_pktmbuf_mtod(events[j]->mbuf, void *));
> 
>  	/*
>  	 * Prefetch and forward already prefetched
>  	 * packets.
>  	 */
> -	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(events[
> -				j + PREFETCH_OFFSET]->mbuf, void *));
> +				j + prefetch_offset]->mbuf, void *));
>  		l3fwd_em_simple_process(events[j]->mbuf, qconf);
>  	}
> 
> @@ -188,15 +188,15 @@ l3fwd_em_no_opt_process_event_vector(struct rte_event_vector *vec,
>  	int32_t i;
> 
>  	/* Prefetch first packets */
> -	for (i = 0; i < PREFETCH_OFFSET && i < vec->nb_elem; i++)
> +	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
>  		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
> 
>  	/*
>  	 * Prefetch and forward already prefetched packets.
>  	 */
> -	for (i = 0; i < (vec->nb_elem - PREFETCH_OFFSET); i++) {
> +	for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
>  		rte_prefetch0(
> -			rte_pktmbuf_mtod(mbufs[i + PREFETCH_OFFSET], void *));
> +			rte_pktmbuf_mtod(mbufs[i + prefetch_offset], void *));
>  		dst_ports[i] = l3fwd_em_simple_process(mbufs[i], qconf);
>  	}
> 
> diff --git a/examples/l3fwd/l3fwd_em_hlm.h b/examples/l3fwd/l3fwd_em_hlm.h
> index c1d819997a..764527962b 100644
> --- a/examples/l3fwd/l3fwd_em_hlm.h
> +++ b/examples/l3fwd/l3fwd_em_hlm.h
> @@ -190,7 +190,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	 */
>  	int32_t n = RTE_ALIGN_FLOOR(nb_rx, EM_HASH_LOOKUP_COUNT);
> 
> -	for (j = 0; j < EM_HASH_LOOKUP_COUNT && j < nb_rx; j++) {
> +	for (j = 0; j < prefetch_offset && j < nb_rx; j++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>  					       struct rte_ether_hdr *) + 1);
>  	}
> @@ -207,7 +207,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  		l3_type = pkt_type & RTE_PTYPE_L3_MASK;
>  		tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
> 
> -		for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
> +		for (i = 0, pos = j + prefetch_offset;
>  		     i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
>  			rte_prefetch0(rte_pktmbuf_mtod(
>  					pkts_burst[pos],
> @@ -277,6 +277,9 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
>  	for (j = 0; j < nb_rx; j++)
>  		pkts_burst[j] = ev[j]->mbuf;
> 
> +	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], struct rte_ether_hdr *) + 1);
> +

If there are no prefetches right now, probably no need to add new.
Or if you feel strongly about it - make it as a new patch.

>  	for (j = 0; j < n; j += EM_HASH_LOOKUP_COUNT) {
> 
>  		uint32_t pkt_type = RTE_PTYPE_L3_MASK |
> @@ -289,7 +292,7 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
>  		l3_type = pkt_type & RTE_PTYPE_L3_MASK;
>  		tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
> 
> -		for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
> +		for (i = 0, pos = j + prefetch_offset;
>  		     i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
>  			rte_prefetch0(rte_pktmbuf_mtod(
>  					pkts_burst[pos],
> diff --git a/examples/l3fwd/l3fwd_em_sequential.h b/examples/l3fwd/l3fwd_em_sequential.h
> index 3a40b2e434..f2c6ceb7c0 100644
> --- a/examples/l3fwd/l3fwd_em_sequential.h
> +++ b/examples/l3fwd/l3fwd_em_sequential.h
> @@ -81,20 +81,19 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	int32_t i, j;
>  	uint16_t dst_port[SENDM_PORT_OVERHEAD(MAX_PKT_BURST)];
> 
> -	if (nb_rx > 0) {
> -		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0],
> +	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
>  					       struct rte_ether_hdr *) + 1);
> -	}
> 
> -	for (i = 1, j = 0; j < nb_rx; i++, j++) {
> -		if (i < nb_rx) {
> -			rte_prefetch0(rte_pktmbuf_mtod(
> -					pkts_burst[i],
> -					struct rte_ether_hdr *) + 1);
> -		}
> +	for (j = 0; j < nb_rx - prefetch_offset; j++) {
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j + prefetch_offset],
> +					       struct rte_ether_hdr *) + 1);
>  		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
>  	}
> 
> +	for (; j < nb_rx; j++)
> +		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
> +
>  	send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
>  }
> 
> @@ -106,20 +105,26 @@ static inline void
>  l3fwd_em_process_events(int nb_rx, struct rte_event **events,
>  		     struct lcore_conf *qconf)
>  {
> +	struct rte_mbuf *mbuf;
> +	uint16_t port;
>  	int32_t i, j;
> 
> -	rte_prefetch0(rte_pktmbuf_mtod(events[0]->mbuf,
> -		      struct rte_ether_hdr *) + 1);
> +	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(events[i]->mbuf, struct rte_ether_hdr *) + 1);
> 
> -	for (i = 1, j = 0; j < nb_rx; i++, j++) {
> -		struct rte_mbuf *mbuf = events[j]->mbuf;
> -		uint16_t port;
> +	for (j = 0; j < nb_rx - prefetch_offset; j++) {
> +		rte_prefetch0(rte_pktmbuf_mtod(events[j + prefetch_offset]->mbuf,
> +					       struct rte_ether_hdr *) + 1);
> +		mbuf = events[j]->mbuf;
> +		port = mbuf->port;
> +		mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
> +		process_packet(mbuf, &mbuf->port);
> +		if (mbuf->port == BAD_PORT)
> +			mbuf->port = port;
> +	}
> 
> -		if (i < nb_rx) {
> -			rte_prefetch0(rte_pktmbuf_mtod(
> -					events[i]->mbuf,
> -					struct rte_ether_hdr *) + 1);
> -		}
> +	for (; j < nb_rx; j++) {
> +		mbuf = events[j]->mbuf;
>  		port = mbuf->port;
>  		mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
>  		process_packet(mbuf, &mbuf->port);
> @@ -136,17 +141,22 @@ l3fwd_em_process_event_vector(struct rte_event_vector *vec,
>  	struct rte_mbuf **mbufs = vec->mbufs;
>  	int32_t i, j;
> 
> -	rte_prefetch0(rte_pktmbuf_mtod(mbufs[0], struct rte_ether_hdr *) + 1);
> +	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr *) + 1);
> 
> -	for (i = 0, j = 1; i < vec->nb_elem; i++, j++) {
> -		if (j < vec->nb_elem)
> -			rte_prefetch0(rte_pktmbuf_mtod(mbufs[j],
> -						       struct rte_ether_hdr *) +
> -				      1);
> +	for (i = 0; i < vec->nb_elem - prefetch_offset; i++) {
> +		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
> +					       struct rte_ether_hdr *) + 1);
>  		dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
>  					       attr_valid ? vec->port :
>  							    mbufs[i]->port);
>  	}
> +
> +	for (; i < vec->nb_elem; i++)
> +		dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
> +					       attr_valid ? vec->port :
> +							    mbufs[i]->port);
> +
>  	j = RTE_ALIGN_FLOOR(vec->nb_elem, FWDSTEP);
> 
>  	for (i = 0; i != j; i += FWDSTEP)
> diff --git a/examples/l3fwd/l3fwd_fib.c b/examples/l3fwd/l3fwd_fib.c
> index 82f1739df7..25192611c5 100644
> --- a/examples/l3fwd/l3fwd_fib.c
> +++ b/examples/l3fwd/l3fwd_fib.c
> @@ -24,9 +24,6 @@
>  #include "l3fwd_event.h"
>  #include "l3fwd_route.h"
> 
> -/* Configure how many packets ahead to prefetch for fib. */
> -#define FIB_PREFETCH_OFFSET 4
> -
>  /* A non-existent portid is needed to denote a default hop for fib. */
>  #define FIB_DEFAULT_HOP 999
> 
> @@ -130,14 +127,14 @@ fib_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	int32_t i;
> 
>  	/* Prefetch first packets. */
> -	for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_rx; i++)
> +	for (i = 0; i < prefetch_offset && i < nb_rx; i++)
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
> 
>  	/* Parse packet info and prefetch. */
> -	for (i = 0; i < (nb_rx - FIB_PREFETCH_OFFSET); i++) {
> +	for (i = 0; i < (nb_rx - prefetch_offset); i++) {
>  		/* Prefetch packet. */
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> -				i + FIB_PREFETCH_OFFSET], void *));
> +				i + prefetch_offset], void *));
>  		fib_parse_packet(pkts_burst[i],
>  				&ipv4_arr[ipv4_cnt], &ipv4_cnt,
>  				&ipv6_arr[ipv6_cnt], &ipv6_cnt,
> @@ -302,11 +299,11 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
>  		ipv6_arr_assem = 0;
> 
>  		/* Prefetch first packets. */
> -		for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_deq; i++)
> +		for (i = 0; i < prefetch_offset && i < nb_deq; i++)
>  			rte_prefetch0(rte_pktmbuf_mtod(events[i].mbuf, void *));
> 
>  		/* Parse packet info and prefetch. */
> -		for (i = 0; i < (nb_deq - FIB_PREFETCH_OFFSET); i++) {
> +		for (i = 0; i < (nb_deq - prefetch_offset); i++) {
>  			if (flags & L3FWD_EVENT_TX_ENQ) {
>  				events[i].queue_id = tx_q_id;
>  				events[i].op = RTE_EVENT_OP_FORWARD;
> @@ -318,7 +315,7 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
> 
>  			/* Prefetch packet. */
>  			rte_prefetch0(rte_pktmbuf_mtod(events[
> -					i + FIB_PREFETCH_OFFSET].mbuf,
> +					i + prefetch_offset].mbuf,
>  					void *));
> 
>  			fib_parse_packet(events[i].mbuf,
> @@ -455,12 +452,12 @@ fib_process_event_vector(struct rte_event_vector *vec, uint8_t *type_arr,
>  	ipv6_arr_assem = 0;
> 
>  	/* Prefetch first packets. */
> -	for (i = 0; i < FIB_PREFETCH_OFFSET && i < vec->nb_elem; i++)
> +	for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
>  		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
> 
>  	/* Parse packet info and prefetch. */
> -	for (i = 0; i < (vec->nb_elem - FIB_PREFETCH_OFFSET); i++) {
> -		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + FIB_PREFETCH_OFFSET],
> +	for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
> +		rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
>  					       void *));
>  		fib_parse_packet(mbufs[i], &ipv4_arr[ipv4_cnt], &ipv4_cnt,
>  				 &ipv6_arr[ipv6_cnt], &ipv6_cnt, &type_arr[i]);
> diff --git a/examples/l3fwd/l3fwd_lpm.h b/examples/l3fwd/l3fwd_lpm.h
> index 4ee61e8d88..d81aa2efaf 100644
> --- a/examples/l3fwd/l3fwd_lpm.h
> +++ b/examples/l3fwd/l3fwd_lpm.h
> @@ -82,13 +82,13 @@ l3fwd_lpm_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	int32_t j;
> 
>  	/* Prefetch first packets */
> -	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
> +	for (j = 0; j < prefetch_offset && j < nb_rx; j++)
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
> 
>  	/* Prefetch and forward already prefetched packets. */
> -	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +	for (j = 0; j < (nb_rx - prefetch_offset); j++) {
>  		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> -				j + PREFETCH_OFFSET], void *));
> +				j + prefetch_offset], void *));
>  		l3fwd_lpm_simple_forward(pkts_burst[j], portid, qconf);
>  	}
> 
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index 3c1f827424..5570a11687 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -85,23 +85,20 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  			  uint16_t portid, uint16_t *dst_port,
>  			  struct lcore_conf *qconf, const uint8_t do_step3)
>  {
> -	int32_t i = 0, j = 0;
> +	int32_t i = 0, j = 0, pos = 0;
>  	int32x4_t dip;
>  	uint32_t ipv4_flag;
>  	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>  	const int32_t m = nb_rx % FWDSTEP;
> 
>  	if (k) {
> -		for (i = 0; i < FWDSTEP; i++) {
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> -							void *));
> -		}
> -		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> -			for (i = 0; i < FWDSTEP; i++) {
> -				rte_prefetch0(rte_pktmbuf_mtod(
> -						pkts_burst[j + i + FWDSTEP],
> -						void *));
> -			}
> +		for (i = 0; i < prefetch_offset && i < k; i++)
> +			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
> +
> +		for (j = 0; j != k; j += FWDSTEP) {
> +			for (i = 0, pos = j + prefetch_offset;
> +			     i < FWDSTEP && pos < k; i++, pos++)
> +				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
> 
>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>  			processx4_step2(qconf, dip, ipv4_flag, portid,
> @@ -109,35 +106,9 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  			if (do_step3)
>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>  		}
> -
> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
> -				&dst_port[j]);
> -		if (do_step3)
> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
> -
> -		j += FWDSTEP;
>  	}
> 
>  	if (m) {
> -		/* Prefetch last up to 3 packets one by one */
> -		switch (m) {
> -		case 3:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 2:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 1:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -		}
> -		j -= m;
>  		/* Classify last up to 3 packets one by one */
>  		switch (m) {
>  		case 3:
> diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
> index 994b7dd8e5..0920c0b2f6 100644
> --- a/examples/l3fwd/main.c
> +++ b/examples/l3fwd/main.c
> @@ -59,6 +59,7 @@ uint16_t nb_rxd = RX_DESC_DEFAULT;
>  uint16_t nb_txd = TX_DESC_DEFAULT;
>  uint32_t nb_pkt_per_burst = DEFAULT_PKT_BURST;
>  uint32_t mb_mempool_cache_size = MEMPOOL_CACHE_SIZE;
> +uint16_t prefetch_offset = DEFAULT_PREFECH_OFFSET;
> 
>  /**< Ports set in promiscuous mode off by default. */
>  static int promiscuous_on;
> @@ -769,6 +770,7 @@ static const char short_options[] =
>  #define CMD_LINE_OPT_ALG "alg"
>  #define CMD_LINE_OPT_PKT_BURST "burst"
>  #define CMD_LINE_OPT_MB_CACHE_SIZE "mbcache"
> +#define CMD_PREFETCH_OFFSET "prefetch-offset"
> 
>  enum {
>  	/* long options mapped to a short option */
> @@ -800,6 +802,7 @@ enum {
>  	CMD_LINE_OPT_VECTOR_TMO_NS_NUM,
>  	CMD_LINE_OPT_PKT_BURST_NUM,
>  	CMD_LINE_OPT_MB_CACHE_SIZE_NUM,
> +	CMD_PREFETCH_OFFSET_NUM,
>  };
> 
>  static const struct option lgopts[] = {
> @@ -828,6 +831,7 @@ static const struct option lgopts[] = {
>  	{CMD_LINE_OPT_ALG,   1, 0, CMD_LINE_OPT_ALG_NUM},
>  	{CMD_LINE_OPT_PKT_BURST,   1, 0, CMD_LINE_OPT_PKT_BURST_NUM},
>  	{CMD_LINE_OPT_MB_CACHE_SIZE,   1, 0, CMD_LINE_OPT_MB_CACHE_SIZE_NUM},
> +	{CMD_PREFETCH_OFFSET,   1, 0, CMD_PREFETCH_OFFSET_NUM},
>  	{NULL, 0, 0, 0}
>  };
> 
> @@ -1017,6 +1021,9 @@ parse_args(int argc, char **argv)
>  		case CMD_LINE_OPT_ALG_NUM:
>  			l3fwd_set_alg(optarg);
>  			break;
> +		case CMD_PREFETCH_OFFSET_NUM:
> +			prefetch_offset = strtol(optarg, NULL, 10);

Hmm... might be something like parse_max_pkt_len() is doing, to be more robust?
In fact, probably can re-use the same function, might be just name it in a more generic way.
 

> +			break;
>  		default:
>  			print_usage(prgname);
>  			return -1;
> @@ -1054,6 +1061,13 @@ parse_args(int argc, char **argv)
>  	}
>  #endif
> 
> +	if (prefetch_offset > nb_pkt_per_burst) {
> +		fprintf(stderr, "Prefetch offset (%u) cannot be greater than burst size (%u). "
> +			"Using burst size %u.\n",
> +			prefetch_offset, nb_pkt_per_burst, nb_pkt_per_burst);
> +		prefetch_offset = nb_pkt_per_burst;

Might be just print err message and terminate gracefully? 

> +	}
> +
>  	/*
>  	 * Nothing is selected, pick longest-prefix match
>  	 * as default match.
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-19 16:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-25  7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
2024-12-25 21:21 ` Stephen Hemminger
2025-01-08 13:42 ` Konstantin Ananyev
2025-01-09 11:31   ` huangdengdui
2025-01-10  9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
2025-01-10 17:19   ` Stephen Hemminger
2025-01-14  9:23     ` huangdengdui
2025-01-10 17:20   ` Stephen Hemminger
2025-01-14  9:22     ` huangdengdui
2025-01-14 16:07       ` Stephen Hemminger
2025-01-14 16:11         ` Stephen Hemminger
2025-02-19 16:44   ` Konstantin Ananyev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).