* [PATCH] examples/l3fwd: optimize packet prefetch
@ 2024-12-25 7:53 Dengdui Huang
2024-12-25 21:21 ` Stephen Hemminger
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Dengdui Huang @ 2024-12-25 7:53 UTC (permalink / raw)
To: dev
Cc: wathsala.vithanage, stephen, liuyonglong, fengchengwen, haijie1,
lihuisong
The prefetch window depending on the hardware platform. The current prefetch
policy may not be applicable to all platforms. In most cases, the number of
packets received by Rx burst is small (64 is used in most performance reports).
In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
packets before processing can achieve better performance.
Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
---
examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
1 file changed, 5 insertions(+), 37 deletions(-)
diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index 3c1f827424..0b51782b8c 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
const int32_t m = nb_rx % FWDSTEP;
- if (k) {
- for (i = 0; i < FWDSTEP; i++) {
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
- void *));
- }
- for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
- for (i = 0; i < FWDSTEP; i++) {
- rte_prefetch0(rte_pktmbuf_mtod(
- pkts_burst[j + i + FWDSTEP],
- void *));
- }
+ /* The number of packets is small. Prefetch all packets. */
+ for (i = 0; i < nb_rx; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
+ if (k) {
+ for (j = 0; j != k; j += FWDSTEP) {
processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
processx4_step2(qconf, dip, ipv4_flag, portid,
&pkts_burst[j], &dst_port[j]);
if (do_step3)
processx4_step3(&pkts_burst[j], &dst_port[j]);
}
-
- processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
- processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
- &dst_port[j]);
- if (do_step3)
- processx4_step3(&pkts_burst[j], &dst_port[j]);
-
- j += FWDSTEP;
}
if (m) {
- /* Prefetch last up to 3 packets one by one */
- switch (m) {
- case 3:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- /* fallthrough */
- case 2:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- /* fallthrough */
- case 1:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- }
- j -= m;
/* Classify last up to 3 packets one by one */
switch (m) {
case 3:
--
2.33.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] examples/l3fwd: optimize packet prefetch
2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
@ 2024-12-25 21:21 ` Stephen Hemminger
2025-01-08 13:42 ` Konstantin Ananyev
2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
2 siblings, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2024-12-25 21:21 UTC (permalink / raw)
To: Dengdui Huang
Cc: dev, wathsala.vithanage, liuyonglong, fengchengwen, haijie1, lihuisong
On Wed, 25 Dec 2024 15:53:02 +0800
Dengdui Huang <huangdengdui@huawei.com> wrote:
> From: Dengdui Huang <huangdengdui@huawei.com>
> To: <dev@dpdk.org>
> CC: <wathsala.vithanage@arm.com>, <stephen@networkplumber.org>, <liuyonglong@huawei.com>, <fengchengwen@huawei.com>, <haijie1@huawei.com>, <lihuisong@huawei.com>
> Subject: [PATCH] examples/l3fwd: optimize packet prefetch
> Date: Wed, 25 Dec 2024 15:53:02 +0800
> X-Mailer: git-send-email 2.33.0
>
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.
>
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
I think Vpp had a good description of how to unroll and deal with prefetch.
With larger burst sizes you don't want to prefetch the whole burst.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH] examples/l3fwd: optimize packet prefetch
2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
2024-12-25 21:21 ` Stephen Hemminger
@ 2025-01-08 13:42 ` Konstantin Ananyev
2025-01-09 11:31 ` huangdengdui
2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
2 siblings, 1 reply; 5+ messages in thread
From: Konstantin Ananyev @ 2025-01-08 13:42 UTC (permalink / raw)
To: huangdengdui, dev
Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie,
lihuisong (C)
>
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.
As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
So it could easily be that changes you suggesting will cause performance
boost on one platform and degradation on another.
In fact, right now l3fwd 'prefetch' usage is a bit of mess:
- l3fwd_lpm_neon.h uses FWDSTEP as a prefetch window.
- l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
- rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
Probably what we need here is some unified approach:
configurable at run-time prefetch_window_size that all code-paths will obey.
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
> examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
> 1 file changed, 5 insertions(+), 37 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index 3c1f827424..0b51782b8c 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
> const int32_t m = nb_rx % FWDSTEP;
>
> - if (k) {
> - for (i = 0; i < FWDSTEP; i++) {
> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> - void *));
> - }
> - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> - for (i = 0; i < FWDSTEP; i++) {
> - rte_prefetch0(rte_pktmbuf_mtod(
> - pkts_burst[j + i + FWDSTEP],
> - void *));
> - }
> + /* The number of packets is small. Prefetch all packets. */
> + for (i = 0; i < nb_rx; i++)
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
>
> + if (k) {
> + for (j = 0; j != k; j += FWDSTEP) {
> processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> processx4_step2(qconf, dip, ipv4_flag, portid,
> &pkts_burst[j], &dst_port[j]);
> if (do_step3)
> processx4_step3(&pkts_burst[j], &dst_port[j]);
> }
> -
> - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
> - &dst_port[j]);
> - if (do_step3)
> - processx4_step3(&pkts_burst[j], &dst_port[j]);
> -
> - j += FWDSTEP;
> }
>
> if (m) {
> - /* Prefetch last up to 3 packets one by one */
> - switch (m) {
> - case 3:
> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> - void *));
> - j++;
> - /* fallthrough */
> - case 2:
> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> - void *));
> - j++;
> - /* fallthrough */
> - case 1:
> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> - void *));
> - j++;
> - }
> - j -= m;
> /* Classify last up to 3 packets one by one */
> switch (m) {
> case 3:
> --
> 2.33.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] examples/l3fwd: optimize packet prefetch
2025-01-08 13:42 ` Konstantin Ananyev
@ 2025-01-09 11:31 ` huangdengdui
0 siblings, 0 replies; 5+ messages in thread
From: huangdengdui @ 2025-01-09 11:31 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie,
lihuisong (C)
On 2025/1/8 21:42, Konstantin Ananyev wrote:
>
>
>>
>> The prefetch window depending on the hardware platform. The current prefetch
>> policy may not be applicable to all platforms. In most cases, the number of
>> packets received by Rx burst is small (64 is used in most performance reports).
>> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
>> packets before processing can achieve better performance.
>
> As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
> So it could easily be that changes you suggesting will cause performance
> boost on one platform and degradation on another.
> In fact, right now l3fwd 'prefetch' usage is a bit of mess:
> - l3fwd_lpm_neon.h uses FWDSTEP as a prefetch window.
> - l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
> - rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
>
> Probably what we need here is some unified approach:
> configurable at run-time prefetch_window_size that all code-paths will obey.
Agreed, I'll add a parameter to configure the prefetch window.
>
>> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
>> ---
>> examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>> 1 file changed, 5 insertions(+), 37 deletions(-)
>>
>> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
>> index 3c1f827424..0b51782b8c 100644
>> --- a/examples/l3fwd/l3fwd_lpm_neon.h
>> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
>> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>> const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>> const int32_t m = nb_rx % FWDSTEP;
>>
>> - if (k) {
>> - for (i = 0; i < FWDSTEP; i++) {
>> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
>> - void *));
>> - }
>> - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
>> - for (i = 0; i < FWDSTEP; i++) {
>> - rte_prefetch0(rte_pktmbuf_mtod(
>> - pkts_burst[j + i + FWDSTEP],
>> - void *));
>> - }
>> + /* The number of packets is small. Prefetch all packets. */
>> + for (i = 0; i < nb_rx; i++)
>> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
>>
>> + if (k) {
>> + for (j = 0; j != k; j += FWDSTEP) {
>> processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>> processx4_step2(qconf, dip, ipv4_flag, portid,
>> &pkts_burst[j], &dst_port[j]);
>> if (do_step3)
>> processx4_step3(&pkts_burst[j], &dst_port[j]);
>> }
>> -
>> - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>> - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
>> - &dst_port[j]);
>> - if (do_step3)
>> - processx4_step3(&pkts_burst[j], &dst_port[j]);
>> -
>> - j += FWDSTEP;
>> }
>>
>> if (m) {
>> - /* Prefetch last up to 3 packets one by one */
>> - switch (m) {
>> - case 3:
>> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> - void *));
>> - j++;
>> - /* fallthrough */
>> - case 2:
>> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> - void *));
>> - j++;
>> - /* fallthrough */
>> - case 1:
>> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> - void *));
>> - j++;
>> - }
>> - j -= m;
>> /* Classify last up to 3 packets one by one */
>> switch (m) {
>> case 3:
>> --
>> 2.33.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] examples/l3fwd: add option to set refetch offset
2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
2024-12-25 21:21 ` Stephen Hemminger
2025-01-08 13:42 ` Konstantin Ananyev
@ 2025-01-10 9:37 ` Dengdui Huang
2 siblings, 0 replies; 5+ messages in thread
From: Dengdui Huang @ 2025-01-10 9:37 UTC (permalink / raw)
To: dev
Cc: konstantin.ananyev, wathsala.vithanage, stephen, lihuisong,
fengchengwen, haijie1, liuyonglong
The prefetch window depending on the HW platform. It is difficult to
measure the prefetch window of a HW platform. Therefore, the prefetch
offset option is added to change the prefetch window. User can adjust
the refetch offset to achieve the best prefetch effect.
In addition, this option is used only in the main loop.
Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
---
examples/l3fwd/l3fwd.h | 6 ++-
examples/l3fwd/l3fwd_acl_scalar.h | 6 +--
examples/l3fwd/l3fwd_em.h | 18 ++++-----
examples/l3fwd/l3fwd_em_hlm.h | 9 +++--
examples/l3fwd/l3fwd_em_sequential.h | 60 ++++++++++++++++------------
examples/l3fwd/l3fwd_fib.c | 21 +++++-----
examples/l3fwd/l3fwd_lpm.h | 6 +--
examples/l3fwd/l3fwd_lpm_neon.h | 45 ++++-----------------
examples/l3fwd/main.c | 14 +++++++
9 files changed, 91 insertions(+), 94 deletions(-)
diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
index 0cce3406ee..2272fb2870 100644
--- a/examples/l3fwd/l3fwd.h
+++ b/examples/l3fwd/l3fwd.h
@@ -39,8 +39,7 @@
#define NB_SOCKETS 8
-/* Configure how many packets ahead to prefetch, when reading packets */
-#define PREFETCH_OFFSET 3
+#define DEFAULT_PREFECH_OFFSET 4
/* Used to mark destination port as 'invalid'. */
#define BAD_PORT ((uint16_t)-1)
@@ -119,6 +118,9 @@ extern uint32_t max_pkt_len;
extern uint32_t nb_pkt_per_burst;
extern uint32_t mb_mempool_cache_size;
+/* Prefetch offset of packets processed by the main loop. */
+extern uint16_t prefetch_offset;
+
/* Send burst of packets on an output interface */
static inline int
send_burst(struct lcore_conf *qconf, uint16_t n, uint16_t port)
diff --git a/examples/l3fwd/l3fwd_acl_scalar.h b/examples/l3fwd/l3fwd_acl_scalar.h
index cb22bb49aa..d00730ff25 100644
--- a/examples/l3fwd/l3fwd_acl_scalar.h
+++ b/examples/l3fwd/l3fwd_acl_scalar.h
@@ -72,14 +72,14 @@ l3fwd_acl_prepare_acl_parameter(struct rte_mbuf **pkts_in, struct acl_search_t *
acl->num_ipv6 = 0;
/* Prefetch first packets */
- for (i = 0; i < PREFETCH_OFFSET && i < nb_rx; i++) {
+ for (i = 0; i < prefetch_offset && i < nb_rx; i++) {
rte_prefetch0(rte_pktmbuf_mtod(
pkts_in[i], void *));
}
- for (i = 0; i < (nb_rx - PREFETCH_OFFSET); i++) {
+ for (i = 0; i < (nb_rx - prefetch_offset); i++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts_in[
- i + PREFETCH_OFFSET], void *));
+ i + prefetch_offset], void *));
l3fwd_acl_prepare_one_packet(pkts_in, acl, i);
}
diff --git a/examples/l3fwd/l3fwd_em.h b/examples/l3fwd/l3fwd_em.h
index 1fee2e2e6c..3ef32c9053 100644
--- a/examples/l3fwd/l3fwd_em.h
+++ b/examples/l3fwd/l3fwd_em.h
@@ -132,16 +132,16 @@ l3fwd_em_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
int32_t j;
/* Prefetch first packets */
- for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+ for (j = 0; j < prefetch_offset && j < nb_rx; j++)
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
/*
* Prefetch and forward already prefetched
* packets.
*/
- for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+ for (j = 0; j < (nb_rx - prefetch_offset); j++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
- j + PREFETCH_OFFSET], void *));
+ j + prefetch_offset], void *));
l3fwd_em_simple_forward(pkts_burst[j], portid, qconf);
}
@@ -161,16 +161,16 @@ l3fwd_em_no_opt_process_events(int nb_rx, struct rte_event **events,
int32_t j;
/* Prefetch first packets */
- for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+ for (j = 0; j < prefetch_offset && j < nb_rx; j++)
rte_prefetch0(rte_pktmbuf_mtod(events[j]->mbuf, void *));
/*
* Prefetch and forward already prefetched
* packets.
*/
- for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+ for (j = 0; j < (nb_rx - prefetch_offset); j++) {
rte_prefetch0(rte_pktmbuf_mtod(events[
- j + PREFETCH_OFFSET]->mbuf, void *));
+ j + prefetch_offset]->mbuf, void *));
l3fwd_em_simple_process(events[j]->mbuf, qconf);
}
@@ -188,15 +188,15 @@ l3fwd_em_no_opt_process_event_vector(struct rte_event_vector *vec,
int32_t i;
/* Prefetch first packets */
- for (i = 0; i < PREFETCH_OFFSET && i < vec->nb_elem; i++)
+ for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
/*
* Prefetch and forward already prefetched packets.
*/
- for (i = 0; i < (vec->nb_elem - PREFETCH_OFFSET); i++) {
+ for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
rte_prefetch0(
- rte_pktmbuf_mtod(mbufs[i + PREFETCH_OFFSET], void *));
+ rte_pktmbuf_mtod(mbufs[i + prefetch_offset], void *));
dst_ports[i] = l3fwd_em_simple_process(mbufs[i], qconf);
}
diff --git a/examples/l3fwd/l3fwd_em_hlm.h b/examples/l3fwd/l3fwd_em_hlm.h
index c1d819997a..764527962b 100644
--- a/examples/l3fwd/l3fwd_em_hlm.h
+++ b/examples/l3fwd/l3fwd_em_hlm.h
@@ -190,7 +190,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
*/
int32_t n = RTE_ALIGN_FLOOR(nb_rx, EM_HASH_LOOKUP_COUNT);
- for (j = 0; j < EM_HASH_LOOKUP_COUNT && j < nb_rx; j++) {
+ for (j = 0; j < prefetch_offset && j < nb_rx; j++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
struct rte_ether_hdr *) + 1);
}
@@ -207,7 +207,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
l3_type = pkt_type & RTE_PTYPE_L3_MASK;
tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
- for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
+ for (i = 0, pos = j + prefetch_offset;
i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
rte_prefetch0(rte_pktmbuf_mtod(
pkts_burst[pos],
@@ -277,6 +277,9 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
for (j = 0; j < nb_rx; j++)
pkts_burst[j] = ev[j]->mbuf;
+ for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], struct rte_ether_hdr *) + 1);
+
for (j = 0; j < n; j += EM_HASH_LOOKUP_COUNT) {
uint32_t pkt_type = RTE_PTYPE_L3_MASK |
@@ -289,7 +292,7 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev,
l3_type = pkt_type & RTE_PTYPE_L3_MASK;
tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
- for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT;
+ for (i = 0, pos = j + prefetch_offset;
i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) {
rte_prefetch0(rte_pktmbuf_mtod(
pkts_burst[pos],
diff --git a/examples/l3fwd/l3fwd_em_sequential.h b/examples/l3fwd/l3fwd_em_sequential.h
index 3a40b2e434..f2c6ceb7c0 100644
--- a/examples/l3fwd/l3fwd_em_sequential.h
+++ b/examples/l3fwd/l3fwd_em_sequential.h
@@ -81,20 +81,19 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
int32_t i, j;
uint16_t dst_port[SENDM_PORT_OVERHEAD(MAX_PKT_BURST)];
- if (nb_rx > 0) {
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0],
+ for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
struct rte_ether_hdr *) + 1);
- }
- for (i = 1, j = 0; j < nb_rx; i++, j++) {
- if (i < nb_rx) {
- rte_prefetch0(rte_pktmbuf_mtod(
- pkts_burst[i],
- struct rte_ether_hdr *) + 1);
- }
+ for (j = 0; j < nb_rx - prefetch_offset; j++) {
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j + prefetch_offset],
+ struct rte_ether_hdr *) + 1);
dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
}
+ for (; j < nb_rx; j++)
+ dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
+
send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
}
@@ -106,20 +105,26 @@ static inline void
l3fwd_em_process_events(int nb_rx, struct rte_event **events,
struct lcore_conf *qconf)
{
+ struct rte_mbuf *mbuf;
+ uint16_t port;
int32_t i, j;
- rte_prefetch0(rte_pktmbuf_mtod(events[0]->mbuf,
- struct rte_ether_hdr *) + 1);
+ for (i = 0; i < prefetch_offset && i < nb_rx; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(events[i]->mbuf, struct rte_ether_hdr *) + 1);
- for (i = 1, j = 0; j < nb_rx; i++, j++) {
- struct rte_mbuf *mbuf = events[j]->mbuf;
- uint16_t port;
+ for (j = 0; j < nb_rx - prefetch_offset; j++) {
+ rte_prefetch0(rte_pktmbuf_mtod(events[j + prefetch_offset]->mbuf,
+ struct rte_ether_hdr *) + 1);
+ mbuf = events[j]->mbuf;
+ port = mbuf->port;
+ mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
+ process_packet(mbuf, &mbuf->port);
+ if (mbuf->port == BAD_PORT)
+ mbuf->port = port;
+ }
- if (i < nb_rx) {
- rte_prefetch0(rte_pktmbuf_mtod(
- events[i]->mbuf,
- struct rte_ether_hdr *) + 1);
- }
+ for (; j < nb_rx; j++) {
+ mbuf = events[j]->mbuf;
port = mbuf->port;
mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port);
process_packet(mbuf, &mbuf->port);
@@ -136,17 +141,22 @@ l3fwd_em_process_event_vector(struct rte_event_vector *vec,
struct rte_mbuf **mbufs = vec->mbufs;
int32_t i, j;
- rte_prefetch0(rte_pktmbuf_mtod(mbufs[0], struct rte_ether_hdr *) + 1);
+ for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr *) + 1);
- for (i = 0, j = 1; i < vec->nb_elem; i++, j++) {
- if (j < vec->nb_elem)
- rte_prefetch0(rte_pktmbuf_mtod(mbufs[j],
- struct rte_ether_hdr *) +
- 1);
+ for (i = 0; i < vec->nb_elem - prefetch_offset; i++) {
+ rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
+ struct rte_ether_hdr *) + 1);
dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
attr_valid ? vec->port :
mbufs[i]->port);
}
+
+ for (; i < vec->nb_elem; i++)
+ dst_ports[i] = em_get_dst_port(qconf, mbufs[i],
+ attr_valid ? vec->port :
+ mbufs[i]->port);
+
j = RTE_ALIGN_FLOOR(vec->nb_elem, FWDSTEP);
for (i = 0; i != j; i += FWDSTEP)
diff --git a/examples/l3fwd/l3fwd_fib.c b/examples/l3fwd/l3fwd_fib.c
index 82f1739df7..25192611c5 100644
--- a/examples/l3fwd/l3fwd_fib.c
+++ b/examples/l3fwd/l3fwd_fib.c
@@ -24,9 +24,6 @@
#include "l3fwd_event.h"
#include "l3fwd_route.h"
-/* Configure how many packets ahead to prefetch for fib. */
-#define FIB_PREFETCH_OFFSET 4
-
/* A non-existent portid is needed to denote a default hop for fib. */
#define FIB_DEFAULT_HOP 999
@@ -130,14 +127,14 @@ fib_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
int32_t i;
/* Prefetch first packets. */
- for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_rx; i++)
+ for (i = 0; i < prefetch_offset && i < nb_rx; i++)
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
/* Parse packet info and prefetch. */
- for (i = 0; i < (nb_rx - FIB_PREFETCH_OFFSET); i++) {
+ for (i = 0; i < (nb_rx - prefetch_offset); i++) {
/* Prefetch packet. */
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
- i + FIB_PREFETCH_OFFSET], void *));
+ i + prefetch_offset], void *));
fib_parse_packet(pkts_burst[i],
&ipv4_arr[ipv4_cnt], &ipv4_cnt,
&ipv6_arr[ipv6_cnt], &ipv6_cnt,
@@ -302,11 +299,11 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
ipv6_arr_assem = 0;
/* Prefetch first packets. */
- for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_deq; i++)
+ for (i = 0; i < prefetch_offset && i < nb_deq; i++)
rte_prefetch0(rte_pktmbuf_mtod(events[i].mbuf, void *));
/* Parse packet info and prefetch. */
- for (i = 0; i < (nb_deq - FIB_PREFETCH_OFFSET); i++) {
+ for (i = 0; i < (nb_deq - prefetch_offset); i++) {
if (flags & L3FWD_EVENT_TX_ENQ) {
events[i].queue_id = tx_q_id;
events[i].op = RTE_EVENT_OP_FORWARD;
@@ -318,7 +315,7 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc,
/* Prefetch packet. */
rte_prefetch0(rte_pktmbuf_mtod(events[
- i + FIB_PREFETCH_OFFSET].mbuf,
+ i + prefetch_offset].mbuf,
void *));
fib_parse_packet(events[i].mbuf,
@@ -455,12 +452,12 @@ fib_process_event_vector(struct rte_event_vector *vec, uint8_t *type_arr,
ipv6_arr_assem = 0;
/* Prefetch first packets. */
- for (i = 0; i < FIB_PREFETCH_OFFSET && i < vec->nb_elem; i++)
+ for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++)
rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *));
/* Parse packet info and prefetch. */
- for (i = 0; i < (vec->nb_elem - FIB_PREFETCH_OFFSET); i++) {
- rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + FIB_PREFETCH_OFFSET],
+ for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) {
+ rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset],
void *));
fib_parse_packet(mbufs[i], &ipv4_arr[ipv4_cnt], &ipv4_cnt,
&ipv6_arr[ipv6_cnt], &ipv6_cnt, &type_arr[i]);
diff --git a/examples/l3fwd/l3fwd_lpm.h b/examples/l3fwd/l3fwd_lpm.h
index 4ee61e8d88..d81aa2efaf 100644
--- a/examples/l3fwd/l3fwd_lpm.h
+++ b/examples/l3fwd/l3fwd_lpm.h
@@ -82,13 +82,13 @@ l3fwd_lpm_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
int32_t j;
/* Prefetch first packets */
- for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+ for (j = 0; j < prefetch_offset && j < nb_rx; j++)
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
/* Prefetch and forward already prefetched packets. */
- for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+ for (j = 0; j < (nb_rx - prefetch_offset); j++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
- j + PREFETCH_OFFSET], void *));
+ j + prefetch_offset], void *));
l3fwd_lpm_simple_forward(pkts_burst[j], portid, qconf);
}
diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index 3c1f827424..5570a11687 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -85,23 +85,20 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
uint16_t portid, uint16_t *dst_port,
struct lcore_conf *qconf, const uint8_t do_step3)
{
- int32_t i = 0, j = 0;
+ int32_t i = 0, j = 0, pos = 0;
int32x4_t dip;
uint32_t ipv4_flag;
const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
const int32_t m = nb_rx % FWDSTEP;
if (k) {
- for (i = 0; i < FWDSTEP; i++) {
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
- void *));
- }
- for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
- for (i = 0; i < FWDSTEP; i++) {
- rte_prefetch0(rte_pktmbuf_mtod(
- pkts_burst[j + i + FWDSTEP],
- void *));
- }
+ for (i = 0; i < prefetch_offset && i < k; i++)
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
+
+ for (j = 0; j != k; j += FWDSTEP) {
+ for (i = 0, pos = j + prefetch_offset;
+ i < FWDSTEP && pos < k; i++, pos++)
+ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *));
processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
processx4_step2(qconf, dip, ipv4_flag, portid,
@@ -109,35 +106,9 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
if (do_step3)
processx4_step3(&pkts_burst[j], &dst_port[j]);
}
-
- processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
- processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
- &dst_port[j]);
- if (do_step3)
- processx4_step3(&pkts_burst[j], &dst_port[j]);
-
- j += FWDSTEP;
}
if (m) {
- /* Prefetch last up to 3 packets one by one */
- switch (m) {
- case 3:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- /* fallthrough */
- case 2:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- /* fallthrough */
- case 1:
- rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
- void *));
- j++;
- }
- j -= m;
/* Classify last up to 3 packets one by one */
switch (m) {
case 3:
diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 994b7dd8e5..0920c0b2f6 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -59,6 +59,7 @@ uint16_t nb_rxd = RX_DESC_DEFAULT;
uint16_t nb_txd = TX_DESC_DEFAULT;
uint32_t nb_pkt_per_burst = DEFAULT_PKT_BURST;
uint32_t mb_mempool_cache_size = MEMPOOL_CACHE_SIZE;
+uint16_t prefetch_offset = DEFAULT_PREFECH_OFFSET;
/**< Ports set in promiscuous mode off by default. */
static int promiscuous_on;
@@ -769,6 +770,7 @@ static const char short_options[] =
#define CMD_LINE_OPT_ALG "alg"
#define CMD_LINE_OPT_PKT_BURST "burst"
#define CMD_LINE_OPT_MB_CACHE_SIZE "mbcache"
+#define CMD_PREFETCH_OFFSET "prefetch-offset"
enum {
/* long options mapped to a short option */
@@ -800,6 +802,7 @@ enum {
CMD_LINE_OPT_VECTOR_TMO_NS_NUM,
CMD_LINE_OPT_PKT_BURST_NUM,
CMD_LINE_OPT_MB_CACHE_SIZE_NUM,
+ CMD_PREFETCH_OFFSET_NUM,
};
static const struct option lgopts[] = {
@@ -828,6 +831,7 @@ static const struct option lgopts[] = {
{CMD_LINE_OPT_ALG, 1, 0, CMD_LINE_OPT_ALG_NUM},
{CMD_LINE_OPT_PKT_BURST, 1, 0, CMD_LINE_OPT_PKT_BURST_NUM},
{CMD_LINE_OPT_MB_CACHE_SIZE, 1, 0, CMD_LINE_OPT_MB_CACHE_SIZE_NUM},
+ {CMD_PREFETCH_OFFSET, 1, 0, CMD_PREFETCH_OFFSET_NUM},
{NULL, 0, 0, 0}
};
@@ -1017,6 +1021,9 @@ parse_args(int argc, char **argv)
case CMD_LINE_OPT_ALG_NUM:
l3fwd_set_alg(optarg);
break;
+ case CMD_PREFETCH_OFFSET_NUM:
+ prefetch_offset = strtol(optarg, NULL, 10);
+ break;
default:
print_usage(prgname);
return -1;
@@ -1054,6 +1061,13 @@ parse_args(int argc, char **argv)
}
#endif
+ if (prefetch_offset > nb_pkt_per_burst) {
+ fprintf(stderr, "Prefetch offset (%u) cannot be greater than burst size (%u). "
+ "Using burst size %u.\n",
+ prefetch_offset, nb_pkt_per_burst, nb_pkt_per_burst);
+ prefetch_offset = nb_pkt_per_burst;
+ }
+
/*
* Nothing is selected, pick longest-prefix match
* as default match.
--
2.33.0
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-01-10 9:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang
2024-12-25 21:21 ` Stephen Hemminger
2025-01-08 13:42 ` Konstantin Ananyev
2025-01-09 11:31 ` huangdengdui
2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).