DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/4] l3fwd improvements
@ 2021-03-18 10:25 Ruifeng Wang
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance Ruifeng Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-03-18 10:25 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

This series of patches include changes to l3fwd example application.
Some improvements are made for better usage of CPU cycles and memory.

Ruifeng Wang (4):
  examples/l3fwd: tune prefetch for better performance
  examples/l3fwd: eliminate unnecessary calculations
  examples/l3fwd: eliminate unnecessary reloads in loop
  examples/l3fwd: make data struct to be memory efficient

 examples/l3fwd/l3fwd.h          | 12 ++++++------
 examples/l3fwd/l3fwd_common.h   |  4 ++--
 examples/l3fwd/l3fwd_em.c       |  6 +++---
 examples/l3fwd/l3fwd_lpm.c      | 16 +++++++++-------
 examples/l3fwd/l3fwd_lpm_neon.h | 20 ++++++++++----------
 5 files changed, 30 insertions(+), 28 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
@ 2021-03-18 10:25 ` Ruifeng Wang
  2021-04-13 18:50   ` Jerin Jacob
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-03-18 10:25 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Packet header is prefetched before packet processing for better
memory access performance. As L2 header will be updated by l3fwd,
using of prefetch for store hint will set cache line to proper
status and reduce cache maintenance overhead.

With this change, 12.9% performance uplift was measured on N1SDP
platform with MLX5 NIC.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index d6c0ba64a..ae8840694 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -97,13 +97,13 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 
 	if (k) {
 		for (i = 0; i < FWDSTEP; i++) {
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
+			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
 						struct rte_ether_hdr *) + 1);
 		}
 
 		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
 			for (i = 0; i < FWDSTEP; i++) {
-				rte_prefetch0(rte_pktmbuf_mtod(
+				rte_prefetch0_write(rte_pktmbuf_mtod(
 						pkts_burst[j + i + FWDSTEP],
 						struct rte_ether_hdr *) + 1);
 			}
@@ -124,17 +124,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		/* Prefetch last up to 3 packets one by one */
 		switch (m) {
 		case 3:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
+			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
 						struct rte_ether_hdr *) + 1);
 			j++;
 			/* fallthrough */
 		case 2:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
+			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
 						struct rte_ether_hdr *) + 1);
 			j++;
 			/* fallthrough */
 		case 1:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
+			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
 						struct rte_ether_hdr *) + 1);
 			j++;
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance Ruifeng Wang
@ 2021-03-18 10:25 ` Ruifeng Wang
  2021-04-13 18:40   ` Jerin Jacob
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-03-18 10:25 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Both L2 and L3 headers will be used in forward processing. And these
two headers are in the same cache line. It has the same effect for
prefetching with L2 header address and prefetching with L3 header
address.

Changed to use L2 header address for prefetching. The change showed
no measurable performance improvement, but it definitely removed
unnecessary instructions for address calculation.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index ae8840694..1650ae444 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -98,14 +98,14 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	if (k) {
 		for (i = 0; i < FWDSTEP; i++) {
 			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
-						struct rte_ether_hdr *) + 1);
+							void *));
 		}
 
 		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
 			for (i = 0; i < FWDSTEP; i++) {
 				rte_prefetch0_write(rte_pktmbuf_mtod(
 						pkts_burst[j + i + FWDSTEP],
-						struct rte_ether_hdr *) + 1);
+						void *));
 			}
 
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
@@ -125,17 +125,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		switch (m) {
 		case 3:
 			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 2:
 			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 1:
 			rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 		}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance Ruifeng Wang
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
@ 2021-03-18 10:25 ` Ruifeng Wang
  2021-04-13 17:43   ` Jerin Jacob
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient Ruifeng Wang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-03-18 10:25 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Number of rx queue and number of rx port in lcore config are constants
during the period of l3 forward application running. But compiler has
no this information.

Copied values from lcore config to local variables and used the local
variables for iteration. Compiler can see that the local variables are
not changed, so qconf reloads at each iteration can be eliminated.

The change showed 1.8% performance uplift in single core, single port,
single queue test on N1SDP platform with MLX5 NIC.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index 3dcf1fef1..d338590b9 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -190,14 +190,16 @@ lpm_main_loop(__rte_unused void *dummy)
 	lcore_id = rte_lcore_id();
 	qconf = &lcore_conf[lcore_id];
 
-	if (qconf->n_rx_queue == 0) {
+	uint16_t n_rx_q = qconf->n_rx_queue;
+	uint16_t n_tx_p = qconf->n_tx_port;
+	if (n_rx_q == 0) {
 		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
 		return 0;
 	}
 
 	RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id);
 
-	for (i = 0; i < qconf->n_rx_queue; i++) {
+	for (i = 0; i < n_rx_q; i++) {
 
 		portid = qconf->rx_queue_list[i].port_id;
 		queueid = qconf->rx_queue_list[i].queue_id;
@@ -216,7 +218,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		diff_tsc = cur_tsc - prev_tsc;
 		if (unlikely(diff_tsc > drain_tsc)) {
 
-			for (i = 0; i < qconf->n_tx_port; ++i) {
+			for (i = 0; i < n_tx_p; ++i) {
 				portid = qconf->tx_port_id[i];
 				if (qconf->tx_mbufs[portid].len == 0)
 					continue;
@@ -232,7 +234,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		/*
 		 * Read packet from RX queues
 		 */
-		for (i = 0; i < qconf->n_rx_queue; ++i) {
+		for (i = 0; i < n_rx_q; ++i) {
 			portid = qconf->rx_queue_list[i].port_id;
 			queueid = qconf->rx_queue_list[i].queue_id;
 			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (2 preceding siblings ...)
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
@ 2021-03-18 10:25 ` Ruifeng Wang
  2021-04-13 19:06   ` Jerin Jacob
  2021-04-13  8:24 ` [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-03-18 10:25 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

There are some holes in data struct lcore_conf. The holes are
due to alignment requirement.

For struct lcore_rx_queue, there is no need to make every element
of this type to be cache line aligned, because the data is not
shared between cores.

Member len of struct mbuf_table can be moved out. So data can be
packed and there will be no need to load an extra cache line when
mbuf table is empty.

The change showed slight performance improvement on N1SDP platform.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/l3fwd.h        | 12 ++++++------
 examples/l3fwd/l3fwd_common.h |  4 ++--
 examples/l3fwd/l3fwd_em.c     |  6 +++---
 examples/l3fwd/l3fwd_lpm.c    |  6 +++---
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
index 2cf06099e..f3a301e12 100644
--- a/examples/l3fwd/l3fwd.h
+++ b/examples/l3fwd/l3fwd.h
@@ -57,22 +57,22 @@
 #define HASH_ENTRY_NUMBER_DEFAULT	4
 
 struct mbuf_table {
-	uint16_t len;
 	struct rte_mbuf *m_table[MAX_PKT_BURST];
 };
 
 struct lcore_rx_queue {
 	uint16_t port_id;
 	uint8_t queue_id;
-} __rte_cache_aligned;
+};
 
 struct lcore_conf {
-	uint16_t n_rx_queue;
 	struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
-	uint16_t n_tx_port;
 	uint16_t tx_port_id[RTE_MAX_ETHPORTS];
 	uint16_t tx_queue_id[RTE_MAX_ETHPORTS];
+	uint16_t tx_mbuf_len[RTE_MAX_ETHPORTS];
 	struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS];
+	uint16_t n_rx_queue;
+	uint16_t n_tx_port;
 	void *ipv4_lookup_struct;
 	void *ipv6_lookup_struct;
 } __rte_cache_aligned;
@@ -122,7 +122,7 @@ send_single_packet(struct lcore_conf *qconf,
 {
 	uint16_t len;
 
-	len = qconf->tx_mbufs[port].len;
+	len = qconf->tx_mbuf_len[port];
 	qconf->tx_mbufs[port].m_table[len] = m;
 	len++;
 
@@ -132,7 +132,7 @@ send_single_packet(struct lcore_conf *qconf,
 		len = 0;
 	}
 
-	qconf->tx_mbufs[port].len = len;
+	qconf->tx_mbuf_len[port] = len;
 	return 0;
 }
 
diff --git a/examples/l3fwd/l3fwd_common.h b/examples/l3fwd/l3fwd_common.h
index 7d83ff641..05e03dbfc 100644
--- a/examples/l3fwd/l3fwd_common.h
+++ b/examples/l3fwd/l3fwd_common.h
@@ -183,7 +183,7 @@ send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
 {
 	uint32_t len, j, n;
 
-	len = qconf->tx_mbufs[port].len;
+	len = qconf->tx_mbuf_len[port];
 
 	/*
 	 * If TX buffer for that queue is empty, and we have enough packets,
@@ -258,7 +258,7 @@ send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
 		}
 	}
 
-	qconf->tx_mbufs[port].len = len;
+	qconf->tx_mbuf_len[port] = len;
 }
 
 #endif /* _L3FWD_COMMON_H_ */
diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
index 9996bfba3..1970e0376 100644
--- a/examples/l3fwd/l3fwd_em.c
+++ b/examples/l3fwd/l3fwd_em.c
@@ -662,12 +662,12 @@ em_main_loop(__rte_unused void *dummy)
 
 			for (i = 0; i < qconf->n_tx_port; ++i) {
 				portid = qconf->tx_port_id[i];
-				if (qconf->tx_mbufs[portid].len == 0)
+				if (qconf->tx_mbuf_len[portid] == 0)
 					continue;
 				send_burst(qconf,
-					qconf->tx_mbufs[portid].len,
+					qconf->tx_mbuf_len[portid],
 					portid);
-				qconf->tx_mbufs[portid].len = 0;
+				qconf->tx_mbuf_len[portid] = 0;
 			}
 
 			prev_tsc = cur_tsc;
diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index d338590b9..e62139a0e 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -220,12 +220,12 @@ lpm_main_loop(__rte_unused void *dummy)
 
 			for (i = 0; i < n_tx_p; ++i) {
 				portid = qconf->tx_port_id[i];
-				if (qconf->tx_mbufs[portid].len == 0)
+				if (qconf->tx_mbuf_len[portid] == 0)
 					continue;
 				send_burst(qconf,
-					qconf->tx_mbufs[portid].len,
+					qconf->tx_mbuf_len[portid],
 					portid);
-				qconf->tx_mbufs[portid].len = 0;
+				qconf->tx_mbuf_len[portid] = 0;
 			}
 
 			prev_tsc = cur_tsc;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] l3fwd improvements
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (3 preceding siblings ...)
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient Ruifeng Wang
@ 2021-04-13  8:24 ` Ruifeng Wang
  2021-04-13 17:33   ` Jerin Jacob
  2021-04-16  9:39 ` Ling, WeiX
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-04-13  8:24 UTC (permalink / raw)
  To: Ruifeng Wang, jerinj, hemant.agrawal, thomas,
	Ajit Khaparde (ajit.khaparde
  Cc: dev, nd, Honnappa Nagarahalli, ferruh.yigit, david.marchand, nd

Hello,

This patch series targeted to improve L3fwd example. Performance gain was observed on N1SDP platform.

It would be good if you can run this series on your platforms and see if there is any performance impact.

Thanks,
Ruifeng

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Thursday, March 18, 2021 6:26 PM
> To: jerinj@marvell.com; hemant.agrawal@nxp.com; ferruh.yigit@intel.com;
> thomas@monjalon.net; david.marchand@redhat.com
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: [PATCH 0/4] l3fwd improvements
> 
> This series of patches include changes to l3fwd example application.
> Some improvements are made for better usage of CPU cycles and memory.
> 
> Ruifeng Wang (4):
>   examples/l3fwd: tune prefetch for better performance
>   examples/l3fwd: eliminate unnecessary calculations
>   examples/l3fwd: eliminate unnecessary reloads in loop
>   examples/l3fwd: make data struct to be memory efficient
> 
>  examples/l3fwd/l3fwd.h          | 12 ++++++------
>  examples/l3fwd/l3fwd_common.h   |  4 ++--
>  examples/l3fwd/l3fwd_em.c       |  6 +++---
>  examples/l3fwd/l3fwd_lpm.c      | 16 +++++++++-------
>  examples/l3fwd/l3fwd_lpm_neon.h | 20 ++++++++++----------
>  5 files changed, 30 insertions(+), 28 deletions(-)
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] l3fwd improvements
  2021-04-13  8:24 ` [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
@ 2021-04-13 17:33   ` Jerin Jacob
  0 siblings, 0 replies; 27+ messages in thread
From: Jerin Jacob @ 2021-04-13 17:33 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: jerinj, hemant.agrawal, thomas, Ajit Khaparde (ajit.khaparde,
	dev, nd, Honnappa Nagarahalli, ferruh.yigit, david.marchand

On Tue, Apr 13, 2021 at 1:54 PM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
>
> Hello,
>
> This patch series targeted to improve L3fwd example. Performance gain was observed on N1SDP platform.
>
> It would be good if you can run this series on your platforms and see if there is any performance impact.

I will test and update and update in the respective patches.

>
> Thanks,
> Ruifeng
>
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Thursday, March 18, 2021 6:26 PM
> > To: jerinj@marvell.com; hemant.agrawal@nxp.com; ferruh.yigit@intel.com;
> > thomas@monjalon.net; david.marchand@redhat.com
> > Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>
> > Subject: [PATCH 0/4] l3fwd improvements
> >
> > This series of patches include changes to l3fwd example application.
> > Some improvements are made for better usage of CPU cycles and memory.
> >
> > Ruifeng Wang (4):
> >   examples/l3fwd: tune prefetch for better performance
> >   examples/l3fwd: eliminate unnecessary calculations
> >   examples/l3fwd: eliminate unnecessary reloads in loop
> >   examples/l3fwd: make data struct to be memory efficient
> >
> >  examples/l3fwd/l3fwd.h          | 12 ++++++------
> >  examples/l3fwd/l3fwd_common.h   |  4 ++--
> >  examples/l3fwd/l3fwd_em.c       |  6 +++---
> >  examples/l3fwd/l3fwd_lpm.c      | 16 +++++++++-------
> >  examples/l3fwd/l3fwd_lpm_neon.h | 20 ++++++++++----------
> >  5 files changed, 30 insertions(+), 28 deletions(-)
> >
> > --
> > 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
@ 2021-04-13 17:43   ` Jerin Jacob
  2021-04-14  6:02     ` Ruifeng Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Jerin Jacob @ 2021-04-13 17:43 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli

On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Number of rx queue and number of rx port in lcore config are constants
> during the period of l3 forward application running. But compiler has
> no this information.
>
> Copied values from lcore config to local variables and used the local
> variables for iteration. Compiler can see that the local variables are
> not changed, so qconf reloads at each iteration can be eliminated.
>
> The change showed 1.8% performance uplift in single core, single port,
> single queue test on N1SDP platform with MLX5 NIC.

At least, in octeontx2, I dont see any performance improvement.
But change looks good. Please find below a comment.

>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index 3dcf1fef1..d338590b9 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -190,14 +190,16 @@ lpm_main_loop(__rte_unused void *dummy)
>         lcore_id = rte_lcore_id();
>         qconf = &lcore_conf[lcore_id];
>
> -       if (qconf->n_rx_queue == 0) {
> +       uint16_t n_rx_q = qconf->n_rx_queue;
> +       uint16_t n_tx_p = qconf->n_tx_port;

How about adding const?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
@ 2021-04-13 18:40   ` Jerin Jacob
  0 siblings, 0 replies; 27+ messages in thread
From: Jerin Jacob @ 2021-04-13 18:40 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli

On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Both L2 and L3 headers will be used in forward processing. And these
> two headers are in the same cache line. It has the same effect for
> prefetching with L2 header address and prefetching with L3 header
> address.
>
> Changed to use L2 header address for prefetching. The change showed
> no measurable performance improvement, but it definitely removed

Same here.

> unnecessary instructions for address calculation.


>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>


> ---
>  examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index ae8840694..1650ae444 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -98,14 +98,14 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>         if (k) {
>                 for (i = 0; i < FWDSTEP; i++) {
>                         rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
> -                                               struct rte_ether_hdr *) + 1);
> +                                                       void *));
>                 }
>
>                 for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
>                         for (i = 0; i < FWDSTEP; i++) {
>                                 rte_prefetch0_write(rte_pktmbuf_mtod(
>                                                 pkts_burst[j + i + FWDSTEP],
> -                                               struct rte_ether_hdr *) + 1);
> +                                               void *));
>                         }
>
>                         processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> @@ -125,17 +125,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>                 switch (m) {
>                 case 3:
>                         rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> -                                               struct rte_ether_hdr *) + 1);
> +                                                       void *));
>                         j++;
>                         /* fallthrough */
>                 case 2:
>                         rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> -                                               struct rte_ether_hdr *) + 1);
> +                                                       void *));
>                         j++;
>                         /* fallthrough */
>                 case 1:
>                         rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> -                                               struct rte_ether_hdr *) + 1);
> +                                                       void *));
>                         j++;
>                 }
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance Ruifeng Wang
@ 2021-04-13 18:50   ` Jerin Jacob
  2021-04-13 20:00     ` Honnappa Nagarahalli
  2021-05-19  7:52     ` Ruifeng Wang
  0 siblings, 2 replies; 27+ messages in thread
From: Jerin Jacob @ 2021-04-13 18:50 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli

On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Packet header is prefetched before packet processing for better
> memory access performance. As L2 header will be updated by l3fwd,
> using of prefetch for store hint will set cache line to proper
> status and reduce cache maintenance overhead.

The code does read the cache line too. Right?

>
> With this change, 12.9% performance uplift was measured on N1SDP
> platform with MLX5 NIC.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>


On the octeontx2 platform, It is 2% regression.

Looks like micro architecture-specific item of handing write hint on the memory
the area that does read and write.


I am testing the LPM lookup miss case.

My test command:
./build/examples/dpdk-l3fwd  -c 0x0100  -- -p 0x1 --config="(0,0,8)" -P



> ---
>  examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index d6c0ba64a..ae8840694 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -97,13 +97,13 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>
>         if (k) {
>                 for (i = 0; i < FWDSTEP; i++) {
> -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> +                       rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
>                                                 struct rte_ether_hdr *) + 1);
>                 }
>
>                 for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
>                         for (i = 0; i < FWDSTEP; i++) {
> -                               rte_prefetch0(rte_pktmbuf_mtod(
> +                               rte_prefetch0_write(rte_pktmbuf_mtod(
>                                                 pkts_burst[j + i + FWDSTEP],
>                                                 struct rte_ether_hdr *) + 1);
>                         }
> @@ -124,17 +124,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>                 /* Prefetch last up to 3 packets one by one */
>                 switch (m) {
>                 case 3:
> -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> +                       rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
>                                                 struct rte_ether_hdr *) + 1);
>                         j++;
>                         /* fallthrough */
>                 case 2:
> -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> +                       rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
>                                                 struct rte_ether_hdr *) + 1);
>                         j++;
>                         /* fallthrough */
>                 case 1:
> -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> +                       rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
>                                                 struct rte_ether_hdr *) + 1);
>                         j++;
>                 }
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient
  2021-03-18 10:25 ` [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient Ruifeng Wang
@ 2021-04-13 19:06   ` Jerin Jacob
  2021-04-21  5:22     ` Hemant Agrawal
  0 siblings, 1 reply; 27+ messages in thread
From: Jerin Jacob @ 2021-04-13 19:06 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli

On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> There are some holes in data struct lcore_conf. The holes are
> due to alignment requirement.
>
> For struct lcore_rx_queue, there is no need to make every element
> of this type to be cache line aligned, because the data is not
> shared between cores.
>
> Member len of struct mbuf_table can be moved out. So data can be
> packed and there will be no need to load an extra cache line when
> mbuf table is empty.
>
> The change showed slight performance improvement on N1SDP platform.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>

This change alone is OK in the octeontx2 platform.(No difference in performance)
combining with 3/4 shows some regression. Probably is due to prefetch
or 128B cache line tuning specifics.


> ---
>  examples/l3fwd/l3fwd.h        | 12 ++++++------
>  examples/l3fwd/l3fwd_common.h |  4 ++--
>  examples/l3fwd/l3fwd_em.c     |  6 +++---
>  examples/l3fwd/l3fwd_lpm.c    |  6 +++---
>  4 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
> index 2cf06099e..f3a301e12 100644
> --- a/examples/l3fwd/l3fwd.h
> +++ b/examples/l3fwd/l3fwd.h
> @@ -57,22 +57,22 @@
>  #define HASH_ENTRY_NUMBER_DEFAULT      4
>
>  struct mbuf_table {
> -       uint16_t len;
>         struct rte_mbuf *m_table[MAX_PKT_BURST];
>  };
>
>  struct lcore_rx_queue {
>         uint16_t port_id;
>         uint8_t queue_id;
> -} __rte_cache_aligned;
> +};
>
>  struct lcore_conf {
> -       uint16_t n_rx_queue;
>         struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
> -       uint16_t n_tx_port;
>         uint16_t tx_port_id[RTE_MAX_ETHPORTS];
>         uint16_t tx_queue_id[RTE_MAX_ETHPORTS];
> +       uint16_t tx_mbuf_len[RTE_MAX_ETHPORTS];
>         struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS];
> +       uint16_t n_rx_queue;
> +       uint16_t n_tx_port;
>         void *ipv4_lookup_struct;
>         void *ipv6_lookup_struct;
>  } __rte_cache_aligned;
> @@ -122,7 +122,7 @@ send_single_packet(struct lcore_conf *qconf,
>  {
>         uint16_t len;
>
> -       len = qconf->tx_mbufs[port].len;
> +       len = qconf->tx_mbuf_len[port];
>         qconf->tx_mbufs[port].m_table[len] = m;
>         len++;
>
> @@ -132,7 +132,7 @@ send_single_packet(struct lcore_conf *qconf,
>                 len = 0;
>         }
>
> -       qconf->tx_mbufs[port].len = len;
> +       qconf->tx_mbuf_len[port] = len;
>         return 0;
>  }
>
> diff --git a/examples/l3fwd/l3fwd_common.h b/examples/l3fwd/l3fwd_common.h
> index 7d83ff641..05e03dbfc 100644
> --- a/examples/l3fwd/l3fwd_common.h
> +++ b/examples/l3fwd/l3fwd_common.h
> @@ -183,7 +183,7 @@ send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
>  {
>         uint32_t len, j, n;
>
> -       len = qconf->tx_mbufs[port].len;
> +       len = qconf->tx_mbuf_len[port];
>
>         /*
>          * If TX buffer for that queue is empty, and we have enough packets,
> @@ -258,7 +258,7 @@ send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
>                 }
>         }
>
> -       qconf->tx_mbufs[port].len = len;
> +       qconf->tx_mbuf_len[port] = len;
>  }
>
>  #endif /* _L3FWD_COMMON_H_ */
> diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
> index 9996bfba3..1970e0376 100644
> --- a/examples/l3fwd/l3fwd_em.c
> +++ b/examples/l3fwd/l3fwd_em.c
> @@ -662,12 +662,12 @@ em_main_loop(__rte_unused void *dummy)
>
>                         for (i = 0; i < qconf->n_tx_port; ++i) {
>                                 portid = qconf->tx_port_id[i];
> -                               if (qconf->tx_mbufs[portid].len == 0)
> +                               if (qconf->tx_mbuf_len[portid] == 0)
>                                         continue;
>                                 send_burst(qconf,
> -                                       qconf->tx_mbufs[portid].len,
> +                                       qconf->tx_mbuf_len[portid],
>                                         portid);
> -                               qconf->tx_mbufs[portid].len = 0;
> +                               qconf->tx_mbuf_len[portid] = 0;
>                         }
>
>                         prev_tsc = cur_tsc;
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index d338590b9..e62139a0e 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -220,12 +220,12 @@ lpm_main_loop(__rte_unused void *dummy)
>
>                         for (i = 0; i < n_tx_p; ++i) {
>                                 portid = qconf->tx_port_id[i];
> -                               if (qconf->tx_mbufs[portid].len == 0)
> +                               if (qconf->tx_mbuf_len[portid] == 0)
>                                         continue;
>                                 send_burst(qconf,
> -                                       qconf->tx_mbufs[portid].len,
> +                                       qconf->tx_mbuf_len[portid],
>                                         portid);
> -                               qconf->tx_mbufs[portid].len = 0;
> +                               qconf->tx_mbuf_len[portid] = 0;
>                         }
>
>                         prev_tsc = cur_tsc;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance
  2021-04-13 18:50   ` Jerin Jacob
@ 2021-04-13 20:00     ` Honnappa Nagarahalli
  2021-05-19  7:52     ` Ruifeng Wang
  1 sibling, 0 replies; 27+ messages in thread
From: Honnappa Nagarahalli @ 2021-04-13 20:00 UTC (permalink / raw)
  To: Jerin Jacob, Ruifeng Wang
  Cc: jerinj, hemant.agrawal, Ferruh Yigit, thomas, David Marchand,
	dpdk-dev, nd, Honnappa Nagarahalli, nd

<snip>

> On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com>
> wrote:
> >
> > Packet header is prefetched before packet processing for better memory
> > access performance. As L2 header will be updated by l3fwd, using of
> > prefetch for store hint will set cache line to proper status and
> > reduce cache maintenance overhead.
> 
> The code does read the cache line too. Right?
> 
> >
> > With this change, 12.9% performance uplift was measured on N1SDP
> > platform with MLX5 NIC.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> 
> On the octeontx2 platform, It is 2% regression.
Thanks Jerin for testing.
It would be good to know the results from others with A72 platforms.

> 
> Looks like micro architecture-specific item of handing write hint on the memory
> the area that does read and write.
> 
> 
> I am testing the LPM lookup miss case.
> 
> My test command:
> ./build/examples/dpdk-l3fwd  -c 0x0100  -- -p 0x1 --config="(0,0,8)" -P
> 
> 
> 
> > ---
> >  examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_lpm_neon.h
> > b/examples/l3fwd/l3fwd_lpm_neon.h index d6c0ba64a..ae8840694 100644
> > --- a/examples/l3fwd/l3fwd_lpm_neon.h
> > +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> > @@ -97,13 +97,13 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf
> > **pkts_burst,
> >
> >         if (k) {
> >                 for (i = 0; i < FWDSTEP; i++) {
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
> >                                                 struct rte_ether_hdr *) + 1);
> >                 }
> >
> >                 for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> >                         for (i = 0; i < FWDSTEP; i++) {
> > -                               rte_prefetch0(rte_pktmbuf_mtod(
> > +                               rte_prefetch0_write(rte_pktmbuf_mtod(
> >                                                 pkts_burst[j + i + FWDSTEP],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         }
> > @@ -124,17 +124,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf
> **pkts_burst,
> >                 /* Prefetch last up to 3 packets one by one */
> >                 switch (m) {
> >                 case 3:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                         /* fallthrough */
> >                 case 2:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                         /* fallthrough */
> >                 case 1:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                 }
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-04-13 17:43   ` Jerin Jacob
@ 2021-04-14  6:02     ` Ruifeng Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-04-14  6:02 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: jerinj, hemant.agrawal, Ferruh Yigit, thomas, David Marchand,
	dpdk-dev, nd, Honnappa Nagarahalli, nd

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, April 14, 2021 1:43 AM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: jerinj@marvell.com; hemant.agrawal@nxp.com; Ferruh Yigit
> <ferruh.yigit@intel.com>; thomas@monjalon.net; David Marchand
> <david.marchand@redhat.com>; dpdk-dev <dev@dpdk.org>; nd
> <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary
> reloads in loop
> 
> On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com>
> wrote:
> >
> > Number of rx queue and number of rx port in lcore config are constants
> > during the period of l3 forward application running. But compiler has
> > no this information.
> >
> > Copied values from lcore config to local variables and used the local
> > variables for iteration. Compiler can see that the local variables are
> > not changed, so qconf reloads at each iteration can be eliminated.
> >
> > The change showed 1.8% performance uplift in single core, single port,
> > single queue test on N1SDP platform with MLX5 NIC.
> 
> At least, in octeontx2, I dont see any performance improvement.
> But change looks good. Please find below a comment.
> 
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> > index 3dcf1fef1..d338590b9 100644
> > --- a/examples/l3fwd/l3fwd_lpm.c
> > +++ b/examples/l3fwd/l3fwd_lpm.c
> > @@ -190,14 +190,16 @@ lpm_main_loop(__rte_unused void *dummy)
> >         lcore_id = rte_lcore_id();
> >         qconf = &lcore_conf[lcore_id];
> >
> > -       if (qconf->n_rx_queue == 0) {
> > +       uint16_t n_rx_q = qconf->n_rx_queue;
> > +       uint16_t n_tx_p = qconf->n_tx_port;
> 
> How about adding const?

Ack. The values are not expected to be changed.
Will update in next version.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] l3fwd improvements
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (4 preceding siblings ...)
  2021-04-13  8:24 ` [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
@ 2021-04-16  9:39 ` Ling, WeiX
  2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
  2021-06-10  6:57 ` [dpdk-dev] [PATCH v3 0/2] l3fwd improvements Ruifeng Wang
  7 siblings, 0 replies; 27+ messages in thread
From: Ling, WeiX @ 2021-04-16  9:39 UTC (permalink / raw)
  To: Ruifeng Wang, jerinj, hemant.agrawal, Yigit, Ferruh, thomas,
	david.marchand
  Cc: dev, nd, honnappa.nagarahalli

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Ruifeng Wang
> Sent: Thursday, March 18, 2021 06:26 PM
> To: jerinj@marvell.com; hemant.agrawal@nxp.com; Yigit, Ferruh
> <ferruh.yigit@intel.com>; thomas@monjalon.net;
> david.marchand@redhat.com
> Cc: dev@dpdk.org; nd@arm.com; honnappa.nagarahalli@arm.com; Ruifeng
> Wang <ruifeng.wang@arm.com>
> Subject: [dpdk-dev] [PATCH 0/4] l3fwd improvements
> 
Tested-by: Wei Ling <weix.ling@intel.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient
  2021-04-13 19:06   ` Jerin Jacob
@ 2021-04-21  5:22     ` Hemant Agrawal
  2021-04-26 10:55       ` Walsh, Conor
  0 siblings, 1 reply; 27+ messages in thread
From: Hemant Agrawal @ 2021-04-21  5:22 UTC (permalink / raw)
  To: Jerin Jacob, Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli


On 4/14/2021 12:36 AM, Jerin Jacob wrote:
> On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>> There are some holes in data struct lcore_conf. The holes are
>> due to alignment requirement.
>>
>> For struct lcore_rx_queue, there is no need to make every element
>> of this type to be cache line aligned, because the data is not
>> shared between cores.
>>
>> Member len of struct mbuf_table can be moved out. So data can be
>> packed and there will be no need to load an extra cache line when
>> mbuf table is empty.
>>
>> The change showed slight performance improvement on N1SDP platform.
>>
>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>
>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> This change alone is OK in the octeontx2 platform.(No difference in performance)
> combining with 3/4 shows some regression. Probably is due to prefetch
> or 128B cache line tuning specifics.

We checked it on Layerscape LS2088A platform. No difference for 1-2 core 
case. However observing ~2% regression for 4-8 cores.

Regards,

Hemant



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient
  2021-04-21  5:22     ` Hemant Agrawal
@ 2021-04-26 10:55       ` Walsh, Conor
  2021-04-27  1:19         ` Ruifeng Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Walsh, Conor @ 2021-04-26 10:55 UTC (permalink / raw)
  To: hemant.agrawal, Jerin Jacob, Ruifeng Wang
  Cc: Jerin Jacob, Yigit, Ferruh, Thomas Monjalon, David Marchand,
	dpdk-dev, nd, Honnappa Nagarahalli

<snip>
> >> There are some holes in data struct lcore_conf. The holes are
> >> due to alignment requirement.
> >>
> >> For struct lcore_rx_queue, there is no need to make every element
> >> of this type to be cache line aligned, because the data is not
> >> shared between cores.
> >>
> >> Member len of struct mbuf_table can be moved out. So data can be
> >> packed and there will be no need to load an extra cache line when
> >> mbuf table is empty.
> >>
> >> The change showed slight performance improvement on N1SDP platform.
> >>
> >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>
> >> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > This change alone is OK in the octeontx2 platform.(No difference in
> performance)
> > combining with 3/4 shows some regression. Probably is due to prefetch
> > or 128B cache line tuning specifics.
> 
> We checked it on Layerscape LS2088A platform. No difference for 1-2 core
> case. However observing ~2% regression for 4-8 cores.
> 
> Regards,
> 
> Hemant
> 

Hi Ruifeng,

l3fwd will no longer build with this patch as you have changed a struct used by the FIB lookup method.
This patch will need to be updated to also update the FIB lookup method as you have done with EM and LPM.

Thanks,
Conor.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient
  2021-04-26 10:55       ` Walsh, Conor
@ 2021-04-27  1:19         ` Ruifeng Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-04-27  1:19 UTC (permalink / raw)
  To: Walsh, Conor, hemant.agrawal, Jerin Jacob
  Cc: jerinj, Yigit, Ferruh, thomas, David Marchand, dpdk-dev, nd,
	Honnappa Nagarahalli, nd

> -----Original Message-----
> From: Walsh, Conor <conor.walsh@intel.com>
> Sent: Monday, April 26, 2021 6:55 PM
> To: hemant.agrawal@nxp.com; Jerin Jacob <jerinjacobk@gmail.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: jerinj@marvell.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> thomas@monjalon.net; David Marchand <david.marchand@redhat.com>;
> dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be
> memory efficient
> 
> <snip>
> > >> There are some holes in data struct lcore_conf. The holes are due
> > >> to alignment requirement.
> > >>
> > >> For struct lcore_rx_queue, there is no need to make every element
> > >> of this type to be cache line aligned, because the data is not
> > >> shared between cores.
> > >>
> > >> Member len of struct mbuf_table can be moved out. So data can be
> > >> packed and there will be no need to load an extra cache line when
> > >> mbuf table is empty.
> > >>
> > >> The change showed slight performance improvement on N1SDP
> platform.
> > >>
> > >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >>
> > >> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > This change alone is OK in the octeontx2 platform.(No difference in
> > performance)
> > > combining with 3/4 shows some regression. Probably is due to
> > > prefetch or 128B cache line tuning specifics.
> >
> > We checked it on Layerscape LS2088A platform. No difference for 1-2
> > core case. However observing ~2% regression for 4-8 cores.
> >
> > Regards,
> >
> > Hemant
> >
> 
> Hi Ruifeng,

Hi Conor,
> 
> l3fwd will no longer build with this patch as you have changed a struct used
> by the FIB lookup method.
> This patch will need to be updated to also update the FIB lookup method as
> you have done with EM and LPM.

Thanks for the comments.
I will provide v2 with updates. I'm considering to drop this one in v2 series.
> 
> Thanks,
> Conor.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance
  2021-04-13 18:50   ` Jerin Jacob
  2021-04-13 20:00     ` Honnappa Nagarahalli
@ 2021-05-19  7:52     ` Ruifeng Wang
  1 sibling, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-05-19  7:52 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: jerinj, hemant.agrawal, Ferruh Yigit, thomas, David Marchand,
	dpdk-dev, nd, Honnappa Nagarahalli, nd

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, April 14, 2021 2:51 AM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: jerinj@marvell.com; hemant.agrawal@nxp.com; Ferruh Yigit
> <ferruh.yigit@intel.com>; thomas@monjalon.net; David Marchand
> <david.marchand@redhat.com>; dpdk-dev <dev@dpdk.org>; nd
> <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for
> better performance
> 
> On Thu, Mar 18, 2021 at 3:56 PM Ruifeng Wang <ruifeng.wang@arm.com>
> wrote:
> >
> > Packet header is prefetched before packet processing for better memory
> > access performance. As L2 header will be updated by l3fwd, using of
> > prefetch for store hint will set cache line to proper status and
> > reduce cache maintenance overhead.
> 
> The code does read the cache line too. Right?
> 
Yes, the code also read the cache line. 
And prefetch to write helps writes. It saves snooping cost.

> >
> > With this change, 12.9% performance uplift was measured on N1SDP
> > platform with MLX5 NIC.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> 
> On the octeontx2 platform, It is 2% regression.
> 
> Looks like micro architecture-specific item of handing write hint on the
> memory the area that does read and write.
> 
OK. Performance impact of the write hint may be different on various micro architecture implementations.
The 12% measured on N1 is not a small enough number to ignore. How about using a flag to distinguish prefetches invoked on different SoCs?
A compile time flag like RTE_USE_PREFETCH_WRITE is introduced. And SoCs enable it based on need.

> 
> I am testing the LPM lookup miss case.
> 
> My test command:
> ./build/examples/dpdk-l3fwd  -c 0x0100  -- -p 0x1 --config="(0,0,8)" -P
> 
> 
> 
> > ---
> >  examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_lpm_neon.h
> > b/examples/l3fwd/l3fwd_lpm_neon.h index d6c0ba64a..ae8840694 100644
> > --- a/examples/l3fwd/l3fwd_lpm_neon.h
> > +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> > @@ -97,13 +97,13 @@ l3fwd_lpm_send_packets(int nb_rx, struct
> rte_mbuf
> > **pkts_burst,
> >
> >         if (k) {
> >                 for (i = 0; i < FWDSTEP; i++) {
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[i],
> >                                                 struct rte_ether_hdr *) + 1);
> >                 }
> >
> >                 for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> >                         for (i = 0; i < FWDSTEP; i++) {
> > -                               rte_prefetch0(rte_pktmbuf_mtod(
> > +                               rte_prefetch0_write(rte_pktmbuf_mtod(
> >                                                 pkts_burst[j + i + FWDSTEP],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         }
> > @@ -124,17 +124,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct
> rte_mbuf **pkts_burst,
> >                 /* Prefetch last up to 3 packets one by one */
> >                 switch (m) {
> >                 case 3:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                         /* fallthrough */
> >                 case 2:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                         /* fallthrough */
> >                 case 1:
> > -                       rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > +
> > + rte_prefetch0_write(rte_pktmbuf_mtod(pkts_burst[j],
> >                                                 struct rte_ether_hdr *) + 1);
> >                         j++;
> >                 }
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 0/3] l3fwd improvements
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (5 preceding siblings ...)
  2021-04-16  9:39 ` Ling, WeiX
@ 2021-06-01  7:56 ` Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance Ruifeng Wang
                     ` (2 more replies)
  2021-06-10  6:57 ` [dpdk-dev] [PATCH v3 0/2] l3fwd improvements Ruifeng Wang
  7 siblings, 3 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-01  7:56 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

This series of patches include changes to l3fwd example application.
Some improvements are made for better usage of CPU cycles and memory.

v2:
Dropped 1/4 prefetch to write change from v1.
Dropped 4/4 data struct change from v1.
Added 1/3 code reorganize.
Updated 3/3 to add 'const'. (Jerin)

Ruifeng Wang (3):
  examples/l3fwd: reorganize code for better performance
  examples/l3fwd: eliminate unnecessary calculations
  examples/l3fwd: eliminate unnecessary reloads in loop

 examples/l3fwd/l3fwd_lpm.c      | 10 ++++++----
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 examples/l3fwd/l3fwd_neon.h     | 10 +++++-----
 3 files changed, 16 insertions(+), 14 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance
  2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
@ 2021-06-01  7:56   ` Ruifeng Wang
  2021-06-06 18:34     ` Jerin Jacob
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 2/3] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
  2 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-01  7:56 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Moved rfc1812 process prior to NEON registers store.
On N1SDP, this reorganization mitigates CPU frontend stall and backend
stall when forwarding.

On N1SDP with MLX5 40G NIC, this change showed 10.2% performance gain
in single port single core MRR test.
On ThunderX2, this changed showed no performance degradation.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/l3fwd_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
index 86ac5971d7..ea7fe22d00 100644
--- a/examples/l3fwd/l3fwd_neon.h
+++ b/examples/l3fwd/l3fwd_neon.h
@@ -43,11 +43,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	ve[2] = vsetq_lane_u32(vgetq_lane_u32(te[2], 3), ve[2], 3);
 	ve[3] = vsetq_lane_u32(vgetq_lane_u32(te[3], 3), ve[3], 3);
 
-	vst1q_u32(p[0], ve[0]);
-	vst1q_u32(p[1], ve[1]);
-	vst1q_u32(p[2], ve[2]);
-	vst1q_u32(p[3], ve[3]);
-
 	rfc1812_process((struct rte_ipv4_hdr *)
 			((struct rte_ether_hdr *)p[0] + 1),
 			&dst_port[0], pkt[0]->packet_type);
@@ -60,6 +55,11 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	rfc1812_process((struct rte_ipv4_hdr *)
 			((struct rte_ether_hdr *)p[3] + 1),
 			&dst_port[3], pkt[3]->packet_type);
+
+	vst1q_u32(p[0], ve[0]);
+	vst1q_u32(p[1], ve[1]);
+	vst1q_u32(p[2], ve[2]);
+	vst1q_u32(p[3], ve[3]);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 2/3] examples/l3fwd: eliminate unnecessary calculations
  2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance Ruifeng Wang
@ 2021-06-01  7:56   ` Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
  2 siblings, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-01  7:56 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Both L2 and L3 headers will be used in forward processing. And these
two headers are in the same cache line. It has the same effect for
prefetching with L2 header address and prefetching with L3 header
address.

Changed to use L2 header address for prefetching. The change showed
no measurable performance improvement, but it definitely removed
unnecessary instructions for address calculation.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index d6c0ba64ab..78ee83b76c 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -98,14 +98,14 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	if (k) {
 		for (i = 0; i < FWDSTEP; i++) {
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
-						struct rte_ether_hdr *) + 1);
+							void *));
 		}
 
 		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
 			for (i = 0; i < FWDSTEP; i++) {
 				rte_prefetch0(rte_pktmbuf_mtod(
 						pkts_burst[j + i + FWDSTEP],
-						struct rte_ether_hdr *) + 1);
+						void *));
 			}
 
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
@@ -125,17 +125,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		switch (m) {
 		case 3:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 2:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 1:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 		}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance Ruifeng Wang
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 2/3] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
@ 2021-06-01  7:56   ` Ruifeng Wang
  2021-06-06 18:39     ` Jerin Jacob
  2 siblings, 1 reply; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-01  7:56 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang, Ola Liljedahl

Number of rx queue and number of rx port in lcore config are constants
during the period of l3 forward application running. But compiler has
no this information.

Copied values from lcore config to local variables and used the local
variables for iteration. Compiler can see that the local variables are
not changed, so qconf reloads at each iteration can be eliminated.

The change showed 1.8% performance uplift in single core, single port,
single queue test on N1SDP platform with MLX5 NIC.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index 427c72b1d2..ff1c18a442 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -154,14 +154,16 @@ lpm_main_loop(__rte_unused void *dummy)
 	lcore_id = rte_lcore_id();
 	qconf = &lcore_conf[lcore_id];
 
-	if (qconf->n_rx_queue == 0) {
+	const uint16_t n_rx_q = qconf->n_rx_queue;
+	const uint16_t n_tx_p = qconf->n_tx_port;
+	if (n_rx_q == 0) {
 		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
 		return 0;
 	}
 
 	RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id);
 
-	for (i = 0; i < qconf->n_rx_queue; i++) {
+	for (i = 0; i < n_rx_q; i++) {
 
 		portid = qconf->rx_queue_list[i].port_id;
 		queueid = qconf->rx_queue_list[i].queue_id;
@@ -181,7 +183,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		diff_tsc = cur_tsc - prev_tsc;
 		if (unlikely(diff_tsc > drain_tsc)) {
 
-			for (i = 0; i < qconf->n_tx_port; ++i) {
+			for (i = 0; i < n_tx_p; ++i) {
 				portid = qconf->tx_port_id[i];
 				if (qconf->tx_mbufs[portid].len == 0)
 					continue;
@@ -197,7 +199,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		/*
 		 * Read packet from RX queues
 		 */
-		for (i = 0; i < qconf->n_rx_queue; ++i) {
+		for (i = 0; i < n_rx_q; ++i) {
 			portid = qconf->rx_queue_list[i].port_id;
 			queueid = qconf->rx_queue_list[i].queue_id;
 			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance Ruifeng Wang
@ 2021-06-06 18:34     ` Jerin Jacob
  0 siblings, 0 replies; 27+ messages in thread
From: Jerin Jacob @ 2021-06-06 18:34 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli

On Tue, Jun 1, 2021 at 1:27 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Moved rfc1812 process prior to NEON registers store.
> On N1SDP, this reorganization mitigates CPU frontend stall and backend
> stall when forwarding.
>
> On N1SDP with MLX5 40G NIC, this change showed 10.2% performance gain
> in single port single core MRR test.

I think, it may not have anything to do with N1SDP, It could be just
the prefetch window timing
with MLX5 driver on Tx mbuf on touching  with tx_burst() and L1 cache
pressure timing.
I think, tuning the driver parameters can switch the window to some driver code.

On Octeontx2, this change has regression of -3.1% flow lookup miss
case. so NACK.


> On ThunderX2, this changed showed no performance degradation.
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  examples/l3fwd/l3fwd_neon.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> index 86ac5971d7..ea7fe22d00 100644
> --- a/examples/l3fwd/l3fwd_neon.h
> +++ b/examples/l3fwd/l3fwd_neon.h
> @@ -43,11 +43,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>         ve[2] = vsetq_lane_u32(vgetq_lane_u32(te[2], 3), ve[2], 3);
>         ve[3] = vsetq_lane_u32(vgetq_lane_u32(te[3], 3), ve[3], 3);
>
> -       vst1q_u32(p[0], ve[0]);
> -       vst1q_u32(p[1], ve[1]);
> -       vst1q_u32(p[2], ve[2]);
> -       vst1q_u32(p[3], ve[3]);
> -
>         rfc1812_process((struct rte_ipv4_hdr *)
>                         ((struct rte_ether_hdr *)p[0] + 1),
>                         &dst_port[0], pkt[0]->packet_type);
> @@ -60,6 +55,11 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>         rfc1812_process((struct rte_ipv4_hdr *)
>                         ((struct rte_ether_hdr *)p[3] + 1),
>                         &dst_port[3], pkt[3]->packet_type);
> +
> +       vst1q_u32(p[0], ve[0]);
> +       vst1q_u32(p[1], ve[1]);
> +       vst1q_u32(p[2], ve[2]);
> +       vst1q_u32(p[3], ve[3]);
>  }
>
>  /*
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
@ 2021-06-06 18:39     ` Jerin Jacob
  0 siblings, 0 replies; 27+ messages in thread
From: Jerin Jacob @ 2021-06-06 18:39 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Jerin Jacob, Hemant Agrawal, Ferruh Yigit, Thomas Monjalon,
	David Marchand, dpdk-dev, nd, Honnappa Nagarahalli,
	Ola Liljedahl

On Tue, Jun 1, 2021 at 1:27 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Number of rx queue and number of rx port in lcore config are constants
> during the period of l3 forward application running. But compiler has
> no this information.
>
> Copied values from lcore config to local variables and used the local
> variables for iteration. Compiler can see that the local variables are
> not changed, so qconf reloads at each iteration can be eliminated.
>
> The change showed 1.8% performance uplift in single core, single port,
> single queue test on N1SDP platform with MLX5 NIC.
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

No performance regression with octeontx2.

Acked-by: Jerin Jacob <jerinj@marvell.com>




> ---
>  examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index 427c72b1d2..ff1c18a442 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -154,14 +154,16 @@ lpm_main_loop(__rte_unused void *dummy)
>         lcore_id = rte_lcore_id();
>         qconf = &lcore_conf[lcore_id];
>
> -       if (qconf->n_rx_queue == 0) {
> +       const uint16_t n_rx_q = qconf->n_rx_queue;
> +       const uint16_t n_tx_p = qconf->n_tx_port;
> +       if (n_rx_q == 0) {
>                 RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
>                 return 0;
>         }
>
>         RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id);
>
> -       for (i = 0; i < qconf->n_rx_queue; i++) {
> +       for (i = 0; i < n_rx_q; i++) {
>
>                 portid = qconf->rx_queue_list[i].port_id;
>                 queueid = qconf->rx_queue_list[i].queue_id;
> @@ -181,7 +183,7 @@ lpm_main_loop(__rte_unused void *dummy)
>                 diff_tsc = cur_tsc - prev_tsc;
>                 if (unlikely(diff_tsc > drain_tsc)) {
>
> -                       for (i = 0; i < qconf->n_tx_port; ++i) {
> +                       for (i = 0; i < n_tx_p; ++i) {
>                                 portid = qconf->tx_port_id[i];
>                                 if (qconf->tx_mbufs[portid].len == 0)
>                                         continue;
> @@ -197,7 +199,7 @@ lpm_main_loop(__rte_unused void *dummy)
>                 /*
>                  * Read packet from RX queues
>                  */
> -               for (i = 0; i < qconf->n_rx_queue; ++i) {
> +               for (i = 0; i < n_rx_q; ++i) {
>                         portid = qconf->rx_queue_list[i].port_id;
>                         queueid = qconf->rx_queue_list[i].queue_id;
>                         nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] l3fwd improvements
  2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
                   ` (6 preceding siblings ...)
  2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
@ 2021-06-10  6:57 ` Ruifeng Wang
  2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 1/2] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
  2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
  7 siblings, 2 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-10  6:57 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

This series of patches include changes to l3fwd example application.
Some improvements are made for better usage of CPU cycles and memory.

v3:
Dropped 1/3 code reorg from v2.

v2:
Dropped 1/4 prefetch to write change from v1.
Dropped 4/4 data struct change from v1.
Added 1/3 code reorganize.
Updated 3/3 to add 'const'. (Jerin)

Ruifeng Wang (2):
  examples/l3fwd: eliminate unnecessary calculations
  examples/l3fwd: eliminate unnecessary reloads in loop

 examples/l3fwd/l3fwd_lpm.c      | 10 ++++++----
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 2 files changed, 11 insertions(+), 9 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] examples/l3fwd: eliminate unnecessary calculations
  2021-06-10  6:57 ` [dpdk-dev] [PATCH v3 0/2] l3fwd improvements Ruifeng Wang
@ 2021-06-10  6:57   ` Ruifeng Wang
  2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
  1 sibling, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-10  6:57 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Both L2 and L3 headers will be used in forward processing. And these
two headers are in the same cache line. It has the same effect for
prefetching with L2 header address and prefetching with L3 header
address.

Changed to use L2 header address for prefetching. The change showed
no measurable performance improvement, but it definitely removed
unnecessary instructions for address calculation.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index d6c0ba64ab..78ee83b76c 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -98,14 +98,14 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	if (k) {
 		for (i = 0; i < FWDSTEP; i++) {
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
-						struct rte_ether_hdr *) + 1);
+							void *));
 		}
 
 		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
 			for (i = 0; i < FWDSTEP; i++) {
 				rte_prefetch0(rte_pktmbuf_mtod(
 						pkts_burst[j + i + FWDSTEP],
-						struct rte_ether_hdr *) + 1);
+						void *));
 			}
 
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
@@ -125,17 +125,17 @@ l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		switch (m) {
 		case 3:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 2:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 			/* fallthrough */
 		case 1:
 			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-						struct rte_ether_hdr *) + 1);
+							void *));
 			j++;
 		}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] examples/l3fwd: eliminate unnecessary reloads in loop
  2021-06-10  6:57 ` [dpdk-dev] [PATCH v3 0/2] l3fwd improvements Ruifeng Wang
  2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 1/2] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
@ 2021-06-10  6:57   ` Ruifeng Wang
  1 sibling, 0 replies; 27+ messages in thread
From: Ruifeng Wang @ 2021-06-10  6:57 UTC (permalink / raw)
  To: jerinj, hemant.agrawal, ferruh.yigit, thomas, david.marchand
  Cc: dev, nd, honnappa.nagarahalli, Ruifeng Wang

Number of rx queue and number of rx port in lcore config are constants
during the period of l3 forward application running. But compiler has
no this information.

Copied values from lcore config to local variables and used the local
variables for iteration. Compiler can see that the local variables are
not changed, so qconf reloads at each iteration can be eliminated.

The change showed 1.8% performance uplift in single core, single port,
single queue test on N1SDP platform with MLX5 NIC.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 examples/l3fwd/l3fwd_lpm.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index 427c72b1d2..ff1c18a442 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -154,14 +154,16 @@ lpm_main_loop(__rte_unused void *dummy)
 	lcore_id = rte_lcore_id();
 	qconf = &lcore_conf[lcore_id];
 
-	if (qconf->n_rx_queue == 0) {
+	const uint16_t n_rx_q = qconf->n_rx_queue;
+	const uint16_t n_tx_p = qconf->n_tx_port;
+	if (n_rx_q == 0) {
 		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
 		return 0;
 	}
 
 	RTE_LOG(INFO, L3FWD, "entering main loop on lcore %u\n", lcore_id);
 
-	for (i = 0; i < qconf->n_rx_queue; i++) {
+	for (i = 0; i < n_rx_q; i++) {
 
 		portid = qconf->rx_queue_list[i].port_id;
 		queueid = qconf->rx_queue_list[i].queue_id;
@@ -181,7 +183,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		diff_tsc = cur_tsc - prev_tsc;
 		if (unlikely(diff_tsc > drain_tsc)) {
 
-			for (i = 0; i < qconf->n_tx_port; ++i) {
+			for (i = 0; i < n_tx_p; ++i) {
 				portid = qconf->tx_port_id[i];
 				if (qconf->tx_mbufs[portid].len == 0)
 					continue;
@@ -197,7 +199,7 @@ lpm_main_loop(__rte_unused void *dummy)
 		/*
 		 * Read packet from RX queues
 		 */
-		for (i = 0; i < qconf->n_rx_queue; ++i) {
+		for (i = 0; i < n_rx_q; ++i) {
 			portid = qconf->rx_queue_list[i].port_id;
 			queueid = qconf->rx_queue_list[i].queue_id;
 			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-06-10  6:58 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18 10:25 [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
2021-03-18 10:25 ` [dpdk-dev] [PATCH 1/4] examples/l3fwd: tune prefetch for better performance Ruifeng Wang
2021-04-13 18:50   ` Jerin Jacob
2021-04-13 20:00     ` Honnappa Nagarahalli
2021-05-19  7:52     ` Ruifeng Wang
2021-03-18 10:25 ` [dpdk-dev] [PATCH 2/4] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
2021-04-13 18:40   ` Jerin Jacob
2021-03-18 10:25 ` [dpdk-dev] [PATCH 3/4] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
2021-04-13 17:43   ` Jerin Jacob
2021-04-14  6:02     ` Ruifeng Wang
2021-03-18 10:25 ` [dpdk-dev] [PATCH 4/4] examples/l3fwd: make data struct to be memory efficient Ruifeng Wang
2021-04-13 19:06   ` Jerin Jacob
2021-04-21  5:22     ` Hemant Agrawal
2021-04-26 10:55       ` Walsh, Conor
2021-04-27  1:19         ` Ruifeng Wang
2021-04-13  8:24 ` [dpdk-dev] [PATCH 0/4] l3fwd improvements Ruifeng Wang
2021-04-13 17:33   ` Jerin Jacob
2021-04-16  9:39 ` Ling, WeiX
2021-06-01  7:56 ` [dpdk-dev] [PATCH v2 0/3] " Ruifeng Wang
2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 1/3] examples/l3fwd: reorganize code for better performance Ruifeng Wang
2021-06-06 18:34     ` Jerin Jacob
2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 2/3] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
2021-06-01  7:56   ` [dpdk-dev] [PATCH v2 3/3] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang
2021-06-06 18:39     ` Jerin Jacob
2021-06-10  6:57 ` [dpdk-dev] [PATCH v3 0/2] l3fwd improvements Ruifeng Wang
2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 1/2] examples/l3fwd: eliminate unnecessary calculations Ruifeng Wang
2021-06-10  6:57   ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd: eliminate unnecessary reloads in loop Ruifeng Wang

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git