* [PATCH] examples/l3fwd: optimize packet prefetch @ 2024-12-25 7:53 Dengdui Huang 2024-12-25 21:21 ` Stephen Hemminger ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Dengdui Huang @ 2024-12-25 7:53 UTC (permalink / raw) To: dev Cc: wathsala.vithanage, stephen, liuyonglong, fengchengwen, haijie1, lihuisong The prefetch window depending on the hardware platform. The current prefetch policy may not be applicable to all platforms. In most cases, the number of packets received by Rx burst is small (64 is used in most performance reports). In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all packets before processing can achieve better performance. Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> --- examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++----------------------------- 1 file changed, 5 insertions(+), 37 deletions(-) diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h index 3c1f827424..0b51782b8c 100644 --- a/examples/l3fwd/l3fwd_lpm_neon.h +++ b/examples/l3fwd/l3fwd_lpm_neon.h @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); const int32_t m = nb_rx % FWDSTEP; - if (k) { - for (i = 0; i < FWDSTEP; i++) { - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], - void *)); - } - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) { - for (i = 0; i < FWDSTEP; i++) { - rte_prefetch0(rte_pktmbuf_mtod( - pkts_burst[j + i + FWDSTEP], - void *)); - } + /* The number of packets is small. Prefetch all packets. */ + for (i = 0; i < nb_rx; i++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *)); + if (k) { + for (j = 0; j != k; j += FWDSTEP) { processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], &dst_port[j]); if (do_step3) processx4_step3(&pkts_burst[j], &dst_port[j]); } - - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], - &dst_port[j]); - if (do_step3) - processx4_step3(&pkts_burst[j], &dst_port[j]); - - j += FWDSTEP; } if (m) { - /* Prefetch last up to 3 packets one by one */ - switch (m) { - case 3: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - /* fallthrough */ - case 2: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - /* fallthrough */ - case 1: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - } - j -= m; /* Classify last up to 3 packets one by one */ switch (m) { case 3: -- 2.33.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: optimize packet prefetch 2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang @ 2024-12-25 21:21 ` Stephen Hemminger 2025-01-08 13:42 ` Konstantin Ananyev 2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang 2 siblings, 0 replies; 11+ messages in thread From: Stephen Hemminger @ 2024-12-25 21:21 UTC (permalink / raw) To: Dengdui Huang Cc: dev, wathsala.vithanage, liuyonglong, fengchengwen, haijie1, lihuisong On Wed, 25 Dec 2024 15:53:02 +0800 Dengdui Huang <huangdengdui@huawei.com> wrote: > From: Dengdui Huang <huangdengdui@huawei.com> > To: <dev@dpdk.org> > CC: <wathsala.vithanage@arm.com>, <stephen@networkplumber.org>, <liuyonglong@huawei.com>, <fengchengwen@huawei.com>, <haijie1@huawei.com>, <lihuisong@huawei.com> > Subject: [PATCH] examples/l3fwd: optimize packet prefetch > Date: Wed, 25 Dec 2024 15:53:02 +0800 > X-Mailer: git-send-email 2.33.0 > > The prefetch window depending on the hardware platform. The current prefetch > policy may not be applicable to all platforms. In most cases, the number of > packets received by Rx burst is small (64 is used in most performance reports). > In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all > packets before processing can achieve better performance. > > Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> > --- I think Vpp had a good description of how to unroll and deal with prefetch. With larger burst sizes you don't want to prefetch the whole burst. ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH] examples/l3fwd: optimize packet prefetch 2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang 2024-12-25 21:21 ` Stephen Hemminger @ 2025-01-08 13:42 ` Konstantin Ananyev 2025-01-09 11:31 ` huangdengdui 2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang 2 siblings, 1 reply; 11+ messages in thread From: Konstantin Ananyev @ 2025-01-08 13:42 UTC (permalink / raw) To: huangdengdui, dev Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie, lihuisong (C) > > The prefetch window depending on the hardware platform. The current prefetch > policy may not be applicable to all platforms. In most cases, the number of > packets received by Rx burst is small (64 is used in most performance reports). > In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all > packets before processing can achieve better performance. As you mentioned 'prefetch' behavior differs a lot from one HW platform to another. So it could easily be that changes you suggesting will cause performance boost on one platform and degradation on another. In fact, right now l3fwd 'prefetch' usage is a bit of mess: - l3fwd_lpm_neon.h uses FWDSTEP as a prefetch window. - l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose - rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all Probably what we need here is some unified approach: configurable at run-time prefetch_window_size that all code-paths will obey. > Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> > --- > examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++----------------------------- > 1 file changed, 5 insertions(+), 37 deletions(-) > > diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h > index 3c1f827424..0b51782b8c 100644 > --- a/examples/l3fwd/l3fwd_lpm_neon.h > +++ b/examples/l3fwd/l3fwd_lpm_neon.h > @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, > const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); > const int32_t m = nb_rx % FWDSTEP; > > - if (k) { > - for (i = 0; i < FWDSTEP; i++) { > - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], > - void *)); > - } > - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) { > - for (i = 0; i < FWDSTEP; i++) { > - rte_prefetch0(rte_pktmbuf_mtod( > - pkts_burst[j + i + FWDSTEP], > - void *)); > - } > + /* The number of packets is small. Prefetch all packets. */ > + for (i = 0; i < nb_rx; i++) > + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *)); > > + if (k) { > + for (j = 0; j != k; j += FWDSTEP) { > processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); > processx4_step2(qconf, dip, ipv4_flag, portid, > &pkts_burst[j], &dst_port[j]); > if (do_step3) > processx4_step3(&pkts_burst[j], &dst_port[j]); > } > - > - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); > - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], > - &dst_port[j]); > - if (do_step3) > - processx4_step3(&pkts_burst[j], &dst_port[j]); > - > - j += FWDSTEP; > } > > if (m) { > - /* Prefetch last up to 3 packets one by one */ > - switch (m) { > - case 3: > - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], > - void *)); > - j++; > - /* fallthrough */ > - case 2: > - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], > - void *)); > - j++; > - /* fallthrough */ > - case 1: > - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], > - void *)); > - j++; > - } > - j -= m; > /* Classify last up to 3 packets one by one */ > switch (m) { > case 3: > -- > 2.33.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: optimize packet prefetch 2025-01-08 13:42 ` Konstantin Ananyev @ 2025-01-09 11:31 ` huangdengdui 0 siblings, 0 replies; 11+ messages in thread From: huangdengdui @ 2025-01-09 11:31 UTC (permalink / raw) To: Konstantin Ananyev, dev Cc: wathsala.vithanage, stephen, liuyonglong, Fengchengwen, haijie, lihuisong (C) On 2025/1/8 21:42, Konstantin Ananyev wrote: > > >> >> The prefetch window depending on the hardware platform. The current prefetch >> policy may not be applicable to all platforms. In most cases, the number of >> packets received by Rx burst is small (64 is used in most performance reports). >> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all >> packets before processing can achieve better performance. > > As you mentioned 'prefetch' behavior differs a lot from one HW platform to another. > So it could easily be that changes you suggesting will cause performance > boost on one platform and degradation on another. > In fact, right now l3fwd 'prefetch' usage is a bit of mess: > - l3fwd_lpm_neon.h uses FWDSTEP as a prefetch window. > - l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose > - rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all > > Probably what we need here is some unified approach: > configurable at run-time prefetch_window_size that all code-paths will obey. Agreed, I'll add a parameter to configure the prefetch window. > >> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> >> --- >> examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++----------------------------- >> 1 file changed, 5 insertions(+), 37 deletions(-) >> >> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h >> index 3c1f827424..0b51782b8c 100644 >> --- a/examples/l3fwd/l3fwd_lpm_neon.h >> +++ b/examples/l3fwd/l3fwd_lpm_neon.h >> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, >> const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); >> const int32_t m = nb_rx % FWDSTEP; >> >> - if (k) { >> - for (i = 0; i < FWDSTEP; i++) { >> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], >> - void *)); >> - } >> - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) { >> - for (i = 0; i < FWDSTEP; i++) { >> - rte_prefetch0(rte_pktmbuf_mtod( >> - pkts_burst[j + i + FWDSTEP], >> - void *)); >> - } >> + /* The number of packets is small. Prefetch all packets. */ >> + for (i = 0; i < nb_rx; i++) >> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *)); >> >> + if (k) { >> + for (j = 0; j != k; j += FWDSTEP) { >> processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); >> processx4_step2(qconf, dip, ipv4_flag, portid, >> &pkts_burst[j], &dst_port[j]); >> if (do_step3) >> processx4_step3(&pkts_burst[j], &dst_port[j]); >> } >> - >> - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); >> - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], >> - &dst_port[j]); >> - if (do_step3) >> - processx4_step3(&pkts_burst[j], &dst_port[j]); >> - >> - j += FWDSTEP; >> } >> >> if (m) { >> - /* Prefetch last up to 3 packets one by one */ >> - switch (m) { >> - case 3: >> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], >> - void *)); >> - j++; >> - /* fallthrough */ >> - case 2: >> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], >> - void *)); >> - j++; >> - /* fallthrough */ >> - case 1: >> - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], >> - void *)); >> - j++; >> - } >> - j -= m; >> /* Classify last up to 3 packets one by one */ >> switch (m) { >> case 3: >> -- >> 2.33.0 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH] examples/l3fwd: add option to set refetch offset 2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang 2024-12-25 21:21 ` Stephen Hemminger 2025-01-08 13:42 ` Konstantin Ananyev @ 2025-01-10 9:37 ` Dengdui Huang 2025-01-10 17:19 ` Stephen Hemminger 2025-01-10 17:20 ` Stephen Hemminger 2 siblings, 2 replies; 11+ messages in thread From: Dengdui Huang @ 2025-01-10 9:37 UTC (permalink / raw) To: dev Cc: konstantin.ananyev, wathsala.vithanage, stephen, lihuisong, fengchengwen, haijie1, liuyonglong The prefetch window depending on the HW platform. It is difficult to measure the prefetch window of a HW platform. Therefore, the prefetch offset option is added to change the prefetch window. User can adjust the refetch offset to achieve the best prefetch effect. In addition, this option is used only in the main loop. Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> --- examples/l3fwd/l3fwd.h | 6 ++- examples/l3fwd/l3fwd_acl_scalar.h | 6 +-- examples/l3fwd/l3fwd_em.h | 18 ++++----- examples/l3fwd/l3fwd_em_hlm.h | 9 +++-- examples/l3fwd/l3fwd_em_sequential.h | 60 ++++++++++++++++------------ examples/l3fwd/l3fwd_fib.c | 21 +++++----- examples/l3fwd/l3fwd_lpm.h | 6 +-- examples/l3fwd/l3fwd_lpm_neon.h | 45 ++++----------------- examples/l3fwd/main.c | 14 +++++++ 9 files changed, 91 insertions(+), 94 deletions(-) diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h index 0cce3406ee..2272fb2870 100644 --- a/examples/l3fwd/l3fwd.h +++ b/examples/l3fwd/l3fwd.h @@ -39,8 +39,7 @@ #define NB_SOCKETS 8 -/* Configure how many packets ahead to prefetch, when reading packets */ -#define PREFETCH_OFFSET 3 +#define DEFAULT_PREFECH_OFFSET 4 /* Used to mark destination port as 'invalid'. */ #define BAD_PORT ((uint16_t)-1) @@ -119,6 +118,9 @@ extern uint32_t max_pkt_len; extern uint32_t nb_pkt_per_burst; extern uint32_t mb_mempool_cache_size; +/* Prefetch offset of packets processed by the main loop. */ +extern uint16_t prefetch_offset; + /* Send burst of packets on an output interface */ static inline int send_burst(struct lcore_conf *qconf, uint16_t n, uint16_t port) diff --git a/examples/l3fwd/l3fwd_acl_scalar.h b/examples/l3fwd/l3fwd_acl_scalar.h index cb22bb49aa..d00730ff25 100644 --- a/examples/l3fwd/l3fwd_acl_scalar.h +++ b/examples/l3fwd/l3fwd_acl_scalar.h @@ -72,14 +72,14 @@ l3fwd_acl_prepare_acl_parameter(struct rte_mbuf **pkts_in, struct acl_search_t * acl->num_ipv6 = 0; /* Prefetch first packets */ - for (i = 0; i < PREFETCH_OFFSET && i < nb_rx; i++) { + for (i = 0; i < prefetch_offset && i < nb_rx; i++) { rte_prefetch0(rte_pktmbuf_mtod( pkts_in[i], void *)); } - for (i = 0; i < (nb_rx - PREFETCH_OFFSET); i++) { + for (i = 0; i < (nb_rx - prefetch_offset); i++) { rte_prefetch0(rte_pktmbuf_mtod(pkts_in[ - i + PREFETCH_OFFSET], void *)); + i + prefetch_offset], void *)); l3fwd_acl_prepare_one_packet(pkts_in, acl, i); } diff --git a/examples/l3fwd/l3fwd_em.h b/examples/l3fwd/l3fwd_em.h index 1fee2e2e6c..3ef32c9053 100644 --- a/examples/l3fwd/l3fwd_em.h +++ b/examples/l3fwd/l3fwd_em.h @@ -132,16 +132,16 @@ l3fwd_em_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, int32_t j; /* Prefetch first packets */ - for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + for (j = 0; j < prefetch_offset && j < nb_rx; j++) rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); /* * Prefetch and forward already prefetched * packets. */ - for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + for (j = 0; j < (nb_rx - prefetch_offset); j++) { rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ - j + PREFETCH_OFFSET], void *)); + j + prefetch_offset], void *)); l3fwd_em_simple_forward(pkts_burst[j], portid, qconf); } @@ -161,16 +161,16 @@ l3fwd_em_no_opt_process_events(int nb_rx, struct rte_event **events, int32_t j; /* Prefetch first packets */ - for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + for (j = 0; j < prefetch_offset && j < nb_rx; j++) rte_prefetch0(rte_pktmbuf_mtod(events[j]->mbuf, void *)); /* * Prefetch and forward already prefetched * packets. */ - for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + for (j = 0; j < (nb_rx - prefetch_offset); j++) { rte_prefetch0(rte_pktmbuf_mtod(events[ - j + PREFETCH_OFFSET]->mbuf, void *)); + j + prefetch_offset]->mbuf, void *)); l3fwd_em_simple_process(events[j]->mbuf, qconf); } @@ -188,15 +188,15 @@ l3fwd_em_no_opt_process_event_vector(struct rte_event_vector *vec, int32_t i; /* Prefetch first packets */ - for (i = 0; i < PREFETCH_OFFSET && i < vec->nb_elem; i++) + for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++) rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *)); /* * Prefetch and forward already prefetched packets. */ - for (i = 0; i < (vec->nb_elem - PREFETCH_OFFSET); i++) { + for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) { rte_prefetch0( - rte_pktmbuf_mtod(mbufs[i + PREFETCH_OFFSET], void *)); + rte_pktmbuf_mtod(mbufs[i + prefetch_offset], void *)); dst_ports[i] = l3fwd_em_simple_process(mbufs[i], qconf); } diff --git a/examples/l3fwd/l3fwd_em_hlm.h b/examples/l3fwd/l3fwd_em_hlm.h index c1d819997a..764527962b 100644 --- a/examples/l3fwd/l3fwd_em_hlm.h +++ b/examples/l3fwd/l3fwd_em_hlm.h @@ -190,7 +190,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, */ int32_t n = RTE_ALIGN_FLOOR(nb_rx, EM_HASH_LOOKUP_COUNT); - for (j = 0; j < EM_HASH_LOOKUP_COUNT && j < nb_rx; j++) { + for (j = 0; j < prefetch_offset && j < nb_rx; j++) { rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], struct rte_ether_hdr *) + 1); } @@ -207,7 +207,7 @@ l3fwd_em_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, l3_type = pkt_type & RTE_PTYPE_L3_MASK; tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP); - for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT; + for (i = 0, pos = j + prefetch_offset; i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) { rte_prefetch0(rte_pktmbuf_mtod( pkts_burst[pos], @@ -277,6 +277,9 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev, for (j = 0; j < nb_rx; j++) pkts_burst[j] = ev[j]->mbuf; + for (i = 0; i < prefetch_offset && i < nb_rx; i++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], struct rte_ether_hdr *) + 1); + for (j = 0; j < n; j += EM_HASH_LOOKUP_COUNT) { uint32_t pkt_type = RTE_PTYPE_L3_MASK | @@ -289,7 +292,7 @@ l3fwd_em_process_events(int nb_rx, struct rte_event **ev, l3_type = pkt_type & RTE_PTYPE_L3_MASK; tcp_or_udp = pkt_type & (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP); - for (i = 0, pos = j + EM_HASH_LOOKUP_COUNT; + for (i = 0, pos = j + prefetch_offset; i < EM_HASH_LOOKUP_COUNT && pos < nb_rx; i++, pos++) { rte_prefetch0(rte_pktmbuf_mtod( pkts_burst[pos], diff --git a/examples/l3fwd/l3fwd_em_sequential.h b/examples/l3fwd/l3fwd_em_sequential.h index 3a40b2e434..f2c6ceb7c0 100644 --- a/examples/l3fwd/l3fwd_em_sequential.h +++ b/examples/l3fwd/l3fwd_em_sequential.h @@ -81,20 +81,19 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, int32_t i, j; uint16_t dst_port[SENDM_PORT_OVERHEAD(MAX_PKT_BURST)]; - if (nb_rx > 0) { - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0], + for (i = 0; i < prefetch_offset && i < nb_rx; i++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], struct rte_ether_hdr *) + 1); - } - for (i = 1, j = 0; j < nb_rx; i++, j++) { - if (i < nb_rx) { - rte_prefetch0(rte_pktmbuf_mtod( - pkts_burst[i], - struct rte_ether_hdr *) + 1); - } + for (j = 0; j < nb_rx - prefetch_offset; j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j + prefetch_offset], + struct rte_ether_hdr *) + 1); dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); } + for (; j < nb_rx; j++) + dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); + send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); } @@ -106,20 +105,26 @@ static inline void l3fwd_em_process_events(int nb_rx, struct rte_event **events, struct lcore_conf *qconf) { + struct rte_mbuf *mbuf; + uint16_t port; int32_t i, j; - rte_prefetch0(rte_pktmbuf_mtod(events[0]->mbuf, - struct rte_ether_hdr *) + 1); + for (i = 0; i < prefetch_offset && i < nb_rx; i++) + rte_prefetch0(rte_pktmbuf_mtod(events[i]->mbuf, struct rte_ether_hdr *) + 1); - for (i = 1, j = 0; j < nb_rx; i++, j++) { - struct rte_mbuf *mbuf = events[j]->mbuf; - uint16_t port; + for (j = 0; j < nb_rx - prefetch_offset; j++) { + rte_prefetch0(rte_pktmbuf_mtod(events[j + prefetch_offset]->mbuf, + struct rte_ether_hdr *) + 1); + mbuf = events[j]->mbuf; + port = mbuf->port; + mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port); + process_packet(mbuf, &mbuf->port); + if (mbuf->port == BAD_PORT) + mbuf->port = port; + } - if (i < nb_rx) { - rte_prefetch0(rte_pktmbuf_mtod( - events[i]->mbuf, - struct rte_ether_hdr *) + 1); - } + for (; j < nb_rx; j++) { + mbuf = events[j]->mbuf; port = mbuf->port; mbuf->port = em_get_dst_port(qconf, mbuf, mbuf->port); process_packet(mbuf, &mbuf->port); @@ -136,17 +141,22 @@ l3fwd_em_process_event_vector(struct rte_event_vector *vec, struct rte_mbuf **mbufs = vec->mbufs; int32_t i, j; - rte_prefetch0(rte_pktmbuf_mtod(mbufs[0], struct rte_ether_hdr *) + 1); + for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++) + rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr *) + 1); - for (i = 0, j = 1; i < vec->nb_elem; i++, j++) { - if (j < vec->nb_elem) - rte_prefetch0(rte_pktmbuf_mtod(mbufs[j], - struct rte_ether_hdr *) + - 1); + for (i = 0; i < vec->nb_elem - prefetch_offset; i++) { + rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset], + struct rte_ether_hdr *) + 1); dst_ports[i] = em_get_dst_port(qconf, mbufs[i], attr_valid ? vec->port : mbufs[i]->port); } + + for (; i < vec->nb_elem; i++) + dst_ports[i] = em_get_dst_port(qconf, mbufs[i], + attr_valid ? vec->port : + mbufs[i]->port); + j = RTE_ALIGN_FLOOR(vec->nb_elem, FWDSTEP); for (i = 0; i != j; i += FWDSTEP) diff --git a/examples/l3fwd/l3fwd_fib.c b/examples/l3fwd/l3fwd_fib.c index 82f1739df7..25192611c5 100644 --- a/examples/l3fwd/l3fwd_fib.c +++ b/examples/l3fwd/l3fwd_fib.c @@ -24,9 +24,6 @@ #include "l3fwd_event.h" #include "l3fwd_route.h" -/* Configure how many packets ahead to prefetch for fib. */ -#define FIB_PREFETCH_OFFSET 4 - /* A non-existent portid is needed to denote a default hop for fib. */ #define FIB_DEFAULT_HOP 999 @@ -130,14 +127,14 @@ fib_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, int32_t i; /* Prefetch first packets. */ - for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_rx; i++) + for (i = 0; i < prefetch_offset && i < nb_rx; i++) rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *)); /* Parse packet info and prefetch. */ - for (i = 0; i < (nb_rx - FIB_PREFETCH_OFFSET); i++) { + for (i = 0; i < (nb_rx - prefetch_offset); i++) { /* Prefetch packet. */ rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ - i + FIB_PREFETCH_OFFSET], void *)); + i + prefetch_offset], void *)); fib_parse_packet(pkts_burst[i], &ipv4_arr[ipv4_cnt], &ipv4_cnt, &ipv6_arr[ipv6_cnt], &ipv6_cnt, @@ -302,11 +299,11 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc, ipv6_arr_assem = 0; /* Prefetch first packets. */ - for (i = 0; i < FIB_PREFETCH_OFFSET && i < nb_deq; i++) + for (i = 0; i < prefetch_offset && i < nb_deq; i++) rte_prefetch0(rte_pktmbuf_mtod(events[i].mbuf, void *)); /* Parse packet info and prefetch. */ - for (i = 0; i < (nb_deq - FIB_PREFETCH_OFFSET); i++) { + for (i = 0; i < (nb_deq - prefetch_offset); i++) { if (flags & L3FWD_EVENT_TX_ENQ) { events[i].queue_id = tx_q_id; events[i].op = RTE_EVENT_OP_FORWARD; @@ -318,7 +315,7 @@ fib_event_loop(struct l3fwd_event_resources *evt_rsrc, /* Prefetch packet. */ rte_prefetch0(rte_pktmbuf_mtod(events[ - i + FIB_PREFETCH_OFFSET].mbuf, + i + prefetch_offset].mbuf, void *)); fib_parse_packet(events[i].mbuf, @@ -455,12 +452,12 @@ fib_process_event_vector(struct rte_event_vector *vec, uint8_t *type_arr, ipv6_arr_assem = 0; /* Prefetch first packets. */ - for (i = 0; i < FIB_PREFETCH_OFFSET && i < vec->nb_elem; i++) + for (i = 0; i < prefetch_offset && i < vec->nb_elem; i++) rte_prefetch0(rte_pktmbuf_mtod(mbufs[i], void *)); /* Parse packet info and prefetch. */ - for (i = 0; i < (vec->nb_elem - FIB_PREFETCH_OFFSET); i++) { - rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + FIB_PREFETCH_OFFSET], + for (i = 0; i < (vec->nb_elem - prefetch_offset); i++) { + rte_prefetch0(rte_pktmbuf_mtod(mbufs[i + prefetch_offset], void *)); fib_parse_packet(mbufs[i], &ipv4_arr[ipv4_cnt], &ipv4_cnt, &ipv6_arr[ipv6_cnt], &ipv6_cnt, &type_arr[i]); diff --git a/examples/l3fwd/l3fwd_lpm.h b/examples/l3fwd/l3fwd_lpm.h index 4ee61e8d88..d81aa2efaf 100644 --- a/examples/l3fwd/l3fwd_lpm.h +++ b/examples/l3fwd/l3fwd_lpm.h @@ -82,13 +82,13 @@ l3fwd_lpm_no_opt_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, int32_t j; /* Prefetch first packets */ - for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) + for (j = 0; j < prefetch_offset && j < nb_rx; j++) rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *)); /* Prefetch and forward already prefetched packets. */ - for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + for (j = 0; j < (nb_rx - prefetch_offset); j++) { rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ - j + PREFETCH_OFFSET], void *)); + j + prefetch_offset], void *)); l3fwd_lpm_simple_forward(pkts_burst[j], portid, qconf); } diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h index 3c1f827424..5570a11687 100644 --- a/examples/l3fwd/l3fwd_lpm_neon.h +++ b/examples/l3fwd/l3fwd_lpm_neon.h @@ -85,23 +85,20 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, uint16_t portid, uint16_t *dst_port, struct lcore_conf *qconf, const uint8_t do_step3) { - int32_t i = 0, j = 0; + int32_t i = 0, j = 0, pos = 0; int32x4_t dip; uint32_t ipv4_flag; const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); const int32_t m = nb_rx % FWDSTEP; if (k) { - for (i = 0; i < FWDSTEP; i++) { - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], - void *)); - } - for (j = 0; j != k - FWDSTEP; j += FWDSTEP) { - for (i = 0; i < FWDSTEP; i++) { - rte_prefetch0(rte_pktmbuf_mtod( - pkts_burst[j + i + FWDSTEP], - void *)); - } + for (i = 0; i < prefetch_offset && i < k; i++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *)); + + for (j = 0; j != k; j += FWDSTEP) { + for (i = 0, pos = j + prefetch_offset; + i < FWDSTEP && pos < k; i++, pos++) + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *)); processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); processx4_step2(qconf, dip, ipv4_flag, portid, @@ -109,35 +106,9 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst, if (do_step3) processx4_step3(&pkts_burst[j], &dst_port[j]); } - - processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); - processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], - &dst_port[j]); - if (do_step3) - processx4_step3(&pkts_burst[j], &dst_port[j]); - - j += FWDSTEP; } if (m) { - /* Prefetch last up to 3 packets one by one */ - switch (m) { - case 3: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - /* fallthrough */ - case 2: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - /* fallthrough */ - case 1: - rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], - void *)); - j++; - } - j -= m; /* Classify last up to 3 packets one by one */ switch (m) { case 3: diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c index 994b7dd8e5..0920c0b2f6 100644 --- a/examples/l3fwd/main.c +++ b/examples/l3fwd/main.c @@ -59,6 +59,7 @@ uint16_t nb_rxd = RX_DESC_DEFAULT; uint16_t nb_txd = TX_DESC_DEFAULT; uint32_t nb_pkt_per_burst = DEFAULT_PKT_BURST; uint32_t mb_mempool_cache_size = MEMPOOL_CACHE_SIZE; +uint16_t prefetch_offset = DEFAULT_PREFECH_OFFSET; /**< Ports set in promiscuous mode off by default. */ static int promiscuous_on; @@ -769,6 +770,7 @@ static const char short_options[] = #define CMD_LINE_OPT_ALG "alg" #define CMD_LINE_OPT_PKT_BURST "burst" #define CMD_LINE_OPT_MB_CACHE_SIZE "mbcache" +#define CMD_PREFETCH_OFFSET "prefetch-offset" enum { /* long options mapped to a short option */ @@ -800,6 +802,7 @@ enum { CMD_LINE_OPT_VECTOR_TMO_NS_NUM, CMD_LINE_OPT_PKT_BURST_NUM, CMD_LINE_OPT_MB_CACHE_SIZE_NUM, + CMD_PREFETCH_OFFSET_NUM, }; static const struct option lgopts[] = { @@ -828,6 +831,7 @@ static const struct option lgopts[] = { {CMD_LINE_OPT_ALG, 1, 0, CMD_LINE_OPT_ALG_NUM}, {CMD_LINE_OPT_PKT_BURST, 1, 0, CMD_LINE_OPT_PKT_BURST_NUM}, {CMD_LINE_OPT_MB_CACHE_SIZE, 1, 0, CMD_LINE_OPT_MB_CACHE_SIZE_NUM}, + {CMD_PREFETCH_OFFSET, 1, 0, CMD_PREFETCH_OFFSET_NUM}, {NULL, 0, 0, 0} }; @@ -1017,6 +1021,9 @@ parse_args(int argc, char **argv) case CMD_LINE_OPT_ALG_NUM: l3fwd_set_alg(optarg); break; + case CMD_PREFETCH_OFFSET_NUM: + prefetch_offset = strtol(optarg, NULL, 10); + break; default: print_usage(prgname); return -1; @@ -1054,6 +1061,13 @@ parse_args(int argc, char **argv) } #endif + if (prefetch_offset > nb_pkt_per_burst) { + fprintf(stderr, "Prefetch offset (%u) cannot be greater than burst size (%u). " + "Using burst size %u.\n", + prefetch_offset, nb_pkt_per_burst, nb_pkt_per_burst); + prefetch_offset = nb_pkt_per_burst; + } + /* * Nothing is selected, pick longest-prefix match * as default match. -- 2.33.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang @ 2025-01-10 17:19 ` Stephen Hemminger 2025-01-14 9:23 ` huangdengdui 2025-01-10 17:20 ` Stephen Hemminger 1 sibling, 1 reply; 11+ messages in thread From: Stephen Hemminger @ 2025-01-10 17:19 UTC (permalink / raw) To: Dengdui Huang Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On Fri, 10 Jan 2025 17:37:15 +0800 Dengdui Huang <huangdengdui@huawei.com> wrote: > +#define DEFAULT_PREFECH_OFFSET 4 Spelling ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-10 17:19 ` Stephen Hemminger @ 2025-01-14 9:23 ` huangdengdui 0 siblings, 0 replies; 11+ messages in thread From: huangdengdui @ 2025-01-14 9:23 UTC (permalink / raw) To: Stephen Hemminger Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On 2025/1/11 1:19, Stephen Hemminger wrote: > On Fri, 10 Jan 2025 17:37:15 +0800 > Dengdui Huang <huangdengdui@huawei.com> wrote: > >> +#define DEFAULT_PREFECH_OFFSET 4 > > Spelling I made a mistake. I'll fix it for the next version. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang 2025-01-10 17:19 ` Stephen Hemminger @ 2025-01-10 17:20 ` Stephen Hemminger 2025-01-14 9:22 ` huangdengdui 1 sibling, 1 reply; 11+ messages in thread From: Stephen Hemminger @ 2025-01-10 17:20 UTC (permalink / raw) To: Dengdui Huang Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On Fri, 10 Jan 2025 17:37:15 +0800 Dengdui Huang <huangdengdui@huawei.com> wrote: > The prefetch window depending on the HW platform. It is difficult to > measure the prefetch window of a HW platform. Therefore, the prefetch > offset option is added to change the prefetch window. User can adjust > the refetch offset to achieve the best prefetch effect. > > In addition, this option is used only in the main loop. > > Signed-off-by: Dengdui Huang <huangdengdui@huawei.com> > --- This will make it slower for many platforms. GCC will unroll a loop of fixed small size, which is what we want. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-10 17:20 ` Stephen Hemminger @ 2025-01-14 9:22 ` huangdengdui 2025-01-14 16:07 ` Stephen Hemminger 0 siblings, 1 reply; 11+ messages in thread From: huangdengdui @ 2025-01-14 9:22 UTC (permalink / raw) To: Stephen Hemminger Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On 2025/1/11 1:20, Stephen Hemminger wrote: > This will make it slower for many platforms. > GCC will unroll a loop of fixed small size, which is what we want. Do you mean to replace option with a macro? But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options. const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); for (j = 0; j != k; j += FWDSTEP) { for (i = 0, pos = j + prefetch_offset; i < FWDSTEP && pos < k; i++, pos++) rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *)); processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j], &dst_port[j]); if (do_step3) processx4_step3(&pkts_burst[j], &dst_port[j]); } The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform. So I think it's better to use option. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-14 9:22 ` huangdengdui @ 2025-01-14 16:07 ` Stephen Hemminger 2025-01-14 16:11 ` Stephen Hemminger 0 siblings, 1 reply; 11+ messages in thread From: Stephen Hemminger @ 2025-01-14 16:07 UTC (permalink / raw) To: huangdengdui Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On Tue, 14 Jan 2025 17:22:08 +0800 huangdengdui <huangdengdui@huawei.com> wrote: > On 2025/1/11 1:20, Stephen Hemminger wrote: > > This will make it slower for many platforms. > > GCC will unroll a loop of fixed small size, which is what we want. > > Do you mean to replace option with a macro? > But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options. > > const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); > for (j = 0; j != k; j += FWDSTEP) { > for (i = 0, pos = j + prefetch_offset; > i < FWDSTEP && pos < k; i++, pos++) > rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *)); > processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); > processx4_step2(qconf, dip, ipv4_flag, portid, > &pkts_burst[j], &dst_port[j]); > if (do_step3) > processx4_step3(&pkts_burst[j], &dst_port[j]); > } > > The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform. > So I think it's better to use option. The tradeoff is that loop unrolling most often is only done on small fix sized loops. And the cost of a loop with variable small values (branch prediction) is high enough that it could make things slower. Prefetching is a balancing act, and more is not better especially on real workloads. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] examples/l3fwd: add option to set refetch offset 2025-01-14 16:07 ` Stephen Hemminger @ 2025-01-14 16:11 ` Stephen Hemminger 0 siblings, 0 replies; 11+ messages in thread From: Stephen Hemminger @ 2025-01-14 16:11 UTC (permalink / raw) To: huangdengdui Cc: dev, konstantin.ananyev, wathsala.vithanage, lihuisong, fengchengwen, haijie1, liuyonglong On Tue, 14 Jan 2025 08:07:52 -0800 Stephen Hemminger <stephen@networkplumber.org> wrote: > On Tue, 14 Jan 2025 17:22:08 +0800 > huangdengdui <huangdengdui@huawei.com> wrote: > > > On 2025/1/11 1:20, Stephen Hemminger wrote: > > > This will make it slower for many platforms. > > > GCC will unroll a loop of fixed small size, which is what we want. > > > > Do you mean to replace option with a macro? > > But most of prefetch_offset are used with the nb_rx, So using macros is the same as using options. > > > > const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP); > > for (j = 0; j != k; j += FWDSTEP) { > > for (i = 0, pos = j + prefetch_offset; > > i < FWDSTEP && pos < k; i++, pos++) > > rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[pos], void *)); > > processx4_step1(&pkts_burst[j], &dip, &ipv4_flag); > > processx4_step2(qconf, dip, ipv4_flag, portid, > > &pkts_burst[j], &dst_port[j]); > > if (do_step3) > > processx4_step3(&pkts_burst[j], &dst_port[j]); > > } > > > > The option can dynamically adjust the prefetch window, which makes it easier to find the prefetch window for a HW platform. > > So I think it's better to use option. > > The tradeoff is that loop unrolling most often is only done on small fix sized loops. > And the cost of a loop with variable small values (branch prediction) is high enough that it > could make things slower. > > Prefetching is a balancing act, and more is not better especially on real workloads. You might also want to look at the quad loop model used in VPP for prefetching. https://my-vpp-docs.readthedocs.io/en/latest/gettingstarted/developers/vnet.html ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-14 16:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-12-25 7:53 [PATCH] examples/l3fwd: optimize packet prefetch Dengdui Huang 2024-12-25 21:21 ` Stephen Hemminger 2025-01-08 13:42 ` Konstantin Ananyev 2025-01-09 11:31 ` huangdengdui 2025-01-10 9:37 ` [PATCH] examples/l3fwd: add option to set refetch offset Dengdui Huang 2025-01-10 17:19 ` Stephen Hemminger 2025-01-14 9:23 ` huangdengdui 2025-01-10 17:20 ` Stephen Hemminger 2025-01-14 9:22 ` huangdengdui 2025-01-14 16:07 ` Stephen Hemminger 2025-01-14 16:11 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).