From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9C8064601E;
	Wed,  8 Jan 2025 14:42:27 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 398B14014F;
	Wed,  8 Jan 2025 14:42:27 +0100 (CET)
Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190])
 by mails.dpdk.org (Postfix) with ESMTP id C68E5400D6
 for <dev@dpdk.org>; Wed,  8 Jan 2025 14:42:25 +0100 (CET)
Received: from mail.maildlp.com (unknown [172.19.162.112])
 by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4YSpvD6Hd6z22kcP;
 Wed,  8 Jan 2025 21:40:08 +0800 (CST)
Received: from kwepemf200007.china.huawei.com (unknown [7.202.181.233])
 by mail.maildlp.com (Postfix) with ESMTPS id 14C961402C4;
 Wed,  8 Jan 2025 21:42:24 +0800 (CST)
Received: from frapeml500007.china.huawei.com (7.182.85.172) by
 kwepemf200007.china.huawei.com (7.202.181.233) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Wed, 8 Jan 2025 21:42:22 +0800
Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by
 frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039;
 Wed, 8 Jan 2025 14:42:20 +0100
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: huangdengdui <huangdengdui@huawei.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "wathsala.vithanage@arm.com" <wathsala.vithanage@arm.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>, liuyonglong
 <liuyonglong@huawei.com>, Fengchengwen <fengchengwen@huawei.com>, haijie
 <haijie1@huawei.com>, "lihuisong (C)" <lihuisong@huawei.com>
Subject: RE: [PATCH] examples/l3fwd: optimize packet prefetch
Thread-Topic: [PATCH] examples/l3fwd: optimize packet prefetch
Thread-Index: AQHbVqIcSetcUR8ocUOw3vCaD3RnRLMM9pOw
Date: Wed, 8 Jan 2025 13:42:20 +0000
Message-ID: <a966e66c538946d9b22ed337c77d9b25@huawei.com>
References: <20241225075302.353013-1-huangdengdui@huawei.com>
In-Reply-To: <20241225075302.353013-1-huangdengdui@huawei.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.206.138.73]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org



>=20
> The prefetch window depending on the hardware platform. The current prefe=
tch
> policy may not be applicable to all platforms. In most cases, the number =
of
> packets received by Rx burst is small (64 is used in most performance rep=
orts).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.

As you mentioned 'prefetch' behavior differs a lot from one HW platform to =
another.
So it could easily be that changes you suggesting will cause performance
boost on one platform and degradation on another.
In fact, right now l3fwd 'prefetch' usage is a bit of mess:
- l3fwd_lpm_neon.h uses  FWDSTEP as a prefetch window.
- l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
- rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at=
 all
=20
Probably what we need here is some unified approach:
configurable at run-time prefetch_window_size that all code-paths will obey=
.=20

> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
>  examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>  1 file changed, 5 insertions(+), 37 deletions(-)
>=20
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_n=
eon.h
> index 3c1f827424..0b51782b8c 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf =
**pkts_burst,
>  	const int32_t k =3D RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>  	const int32_t m =3D nb_rx % FWDSTEP;
>=20
> -	if (k) {
> -		for (i =3D 0; i < FWDSTEP; i++) {
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> -							void *));
> -		}
> -		for (j =3D 0; j !=3D k - FWDSTEP; j +=3D FWDSTEP) {
> -			for (i =3D 0; i < FWDSTEP; i++) {
> -				rte_prefetch0(rte_pktmbuf_mtod(
> -						pkts_burst[j + i + FWDSTEP],
> -						void *));
> -			}
> +	/* The number of packets is small. Prefetch all packets. */
> +	for (i =3D 0; i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
>=20
> +	if (k) {
> +		for (j =3D 0; j !=3D k; j +=3D FWDSTEP) {
>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>  			processx4_step2(qconf, dip, ipv4_flag, portid,
>  					&pkts_burst[j], &dst_port[j]);
>  			if (do_step3)
>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>  		}
> -
> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
> -				&dst_port[j]);
> -		if (do_step3)
> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
> -
> -		j +=3D FWDSTEP;
>  	}
>=20
>  	if (m) {
> -		/* Prefetch last up to 3 packets one by one */
> -		switch (m) {
> -		case 3:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 2:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 1:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -		}
> -		j -=3D m;
>  		/* Classify last up to 3 packets one by one */
>  		switch (m) {
>  		case 3:
> --
> 2.33.0