From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 2F91746138;
	Mon, 27 Jan 2025 09:20:24 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id EAD8F4060B;
	Mon, 27 Jan 2025 09:20:23 +0100 (CET)
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
 by mails.dpdk.org (Postfix) with ESMTP id A65B1402A4
 for <dev@dpdk.org>; Mon, 27 Jan 2025 09:20:21 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1737966022; x=1769502022;
 h=from:to:cc:subject:date:message-id:references:
 in-reply-to:content-transfer-encoding:mime-version;
 bh=dJAzrcU+9t/Yl8mqnYhenamvFfyUXk+m/tp/0/JvK0g=;
 b=lJdXhODVniZhvjHrg7lYrp/ru0gqFMr9LCfytWq4x8U8S586m9JNPUM0
 8iXW+grK22lJjtEKe7eRpe9pNbUVdXiYpwqnaQqyGxRp/AKN2b8NttBQY
 tfmYTi69Lz1oX7n68FF7s+YiJNHXizOOlLzE/OBAtCekm2L5fNUh1pxxE
 aZGn9+Hvx/rphMvOYihJDBMbrSJc8Wxr7Q5nO1pJSxI4Jju5ENXm3+QsP
 +ahb+zIMe96t+AjmDgr64qwWPnCaEQ1LBw66QRgiijzOoD7UM86ZeLr8o
 2UF5I4KVGtg3TrZP79DsyMmz0DdXAHnZkiGnAiRO/q+djX6MiHkxdl882 Q==;
X-CSE-ConnectionGUID: uBk66sxWToyrq9fhfy9mbA==
X-CSE-MsgGUID: 84+qs+MoRMC81tvcbY/GKg==
X-IronPort-AV: E=McAfee;i="6700,10204,11327"; a="55840820"
X-IronPort-AV: E=Sophos;i="6.13,237,1732608000"; d="scan'208";a="55840820"
Received: from fmviesa009.fm.intel.com ([10.60.135.149])
 by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 27 Jan 2025 00:20:21 -0800
X-CSE-ConnectionGUID: 8lOmNeIeSoKkdBvY8WLi1w==
X-CSE-MsgGUID: IWsyRZS0Q7asWQRQXYeN1A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.13,237,1732608000"; d="scan'208";a="108958732"
Received: from orsmsx603.amr.corp.intel.com ([10.22.229.16])
 by fmviesa009.fm.intel.com with ESMTP/TLS/AES256-GCM-SHA384;
 27 Jan 2025 00:20:20 -0800
Received: from orsmsx601.amr.corp.intel.com (10.22.229.14) by
 ORSMSX603.amr.corp.intel.com (10.22.229.16) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.44; Mon, 27 Jan 2025 00:20:19 -0800
Received: from orsedg603.ED.cps.intel.com (10.7.248.4) by
 orsmsx601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.44 via Frontend Transport; Mon, 27 Jan 2025 00:20:19 -0800
Received: from NAM11-BN8-obe.outbound.protection.outlook.com (104.47.58.176)
 by edgegateway.intel.com (134.134.137.100) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.44; Mon, 27 Jan 2025 00:20:19 -0800
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=GD7K98plYImWRm33uOFWH3HSJHoZ/UMmCZDcw/1NyeoXe47TcToKexqfKxbSJEKfBZJU3EtvFsxAK0qIsXv6ryvlOGkNniDkRArXIrbRT8dyPp13MvUd/Xw0RxfpAnmP+k25r5GxKCdPlWy4tH3JJdyFgmwKvAo0x7cB4iEe5OLwZ31ehDcMa4jNpSASxE4SgdTiy325YT9MG1kk11YCOXjUsxi0qEiTJK9bmlLCEReuIxiWrfAABXHm60nPH/cGen7QbSseqchJvZw+zzhtBkIzp1YqlEXcAUSqEaUyXjvOMJs3lgBpvbt8GNqhDlGooelAaePSC9crCAyqYUH1fQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=LKznPef1QOazfrmHoZQ8X+/L2dBFCEslJKefP7urM9I=;
 b=R3Govihtk5F+WaBPXEuiTG1P6SwZEmWdsW444y/f8p413V+mMdgx3HnTlUBPGZUpLF5XnJ4YNyRKS3sR37qpAW9dwH8lES84sI3QKugnEh16rmt/wsVGQsmIMlqGpgi0Q09wHNDcLhPuoEUm4cBdD7YZxD9uGNJeEudtKyv/dd9SWWpA12PXcSqOefRHAa3j6sMu4DO0Xx2w3YjJDaS5cTBcpSmHHHg6GoCYPjwxZF78neCVX7FSq7O+ER9WX67JU9P8J0/g0xCit+4+rV/Auz4u+/VP0pYMNObDaDdO+FzklzDHYj8tcEJ4Tb2qUHZ0mePywk9sfHUlni81JlBcDA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com;
 dkim=pass header.d=intel.com; arc=none
Received: from SJ0PR11MB5918.namprd11.prod.outlook.com (2603:10b6:a03:42c::22)
 by CY5PR11MB6440.namprd11.prod.outlook.com (2603:10b6:930:33::9) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8377.22; Mon, 27 Jan
 2025 08:19:57 +0000
Received: from SJ0PR11MB5918.namprd11.prod.outlook.com
 ([fe80::891b:9bb3:428a:c72a]) by SJ0PR11MB5918.namprd11.prod.outlook.com
 ([fe80::891b:9bb3:428a:c72a%6]) with mapi id 15.20.8377.021; Mon, 27 Jan 2025
 08:19:57 +0000
From: "Wani, Shaiq" <shaiq.wani@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, "Singh, Aman Deep"
 <aman.deep.singh@intel.com>
Subject: RE: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx
Thread-Topic: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx
Thread-Index: AQHbYccymlq2RDD3LUWj5MZjczserrMfyAGAgAqaPbA=
Date: Mon, 27 Jan 2025 08:19:57 +0000
Message-ID: <SJ0PR11MB591894C888B24C8F586FABE092EC2@SJ0PR11MB5918.namprd11.prod.outlook.com>
References: <20250108121757.170494-1-shaiq.wani@intel.com>
 <20250108121757.170494-2-shaiq.wani@intel.com>
 <Z45aiwoXHy_RlR-Q@bricha3-mobl1.ger.corp.intel.com>
In-Reply-To: <Z45aiwoXHy_RlR-Q@bricha3-mobl1.ger.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=intel.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ0PR11MB5918:EE_|CY5PR11MB6440:EE_
x-ms-office365-filtering-correlation-id: cae28fb7-7a16-4e05-ac89-08dd3eab62d5
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
 ARA:13230040|1800799024|376014|366016|38070700018|13003099007|7053199007; 
x-microsoft-antispam-message-info: =?us-ascii?Q?kdSFLrmIAaQFEzugYdA3E1v0OX5DRYYATbRkkGxf0QONoSioFAXvp+LAI88L?=
 =?us-ascii?Q?8H55H17RW5R88amoKhiWuST4xTqqYBoc3Wj18jKs/Y5Fbx3CGloC+/jVlwHr?=
 =?us-ascii?Q?UjkcdhpHvQaW1ruiA9L38zyYgubywW6xt4oZ1b+cFtwadC7J8tW7Fv3/sB3u?=
 =?us-ascii?Q?Z+K9t2to3q6uB5GUXGPsvmqXcZCVI8qBcMbX9W787tnuptgXMmE+GFDEIhf5?=
 =?us-ascii?Q?0Fj0IBjE5ACOk3oKk35bCC9mm/Za7Izhu/HEwwUAet3c7IrSg3C5hF2qcaI8?=
 =?us-ascii?Q?if+vyVWxhfpIhHZBy8OmSUDhTJyv33TsynD6DDqL1LhBaU5LoCv4ZLPF3kDg?=
 =?us-ascii?Q?d4N3X/rZNI5z6husEgzh96s4JHDkv3AfpMwj5W7QwbXowlFu86dy7Bn8a1KS?=
 =?us-ascii?Q?c5SQ8B9xCSsmIeA2sbxXEnujqnhVUMIQ54ywck/tuEXI3ImmyBuXbyq0Ogk1?=
 =?us-ascii?Q?6WC/ZGOeG2CVvdPH8GKXfDidVNNehA82jGMZ+HKKRy+BMfucjqulg5DqamQe?=
 =?us-ascii?Q?2Xh8HhDAHF3j3dIs54WZemZnJrNJWw5Bp+wkVAe22v2pllS57dOoBlRzxfpq?=
 =?us-ascii?Q?5BFEYI9mdypC6bu68qLAln8WSVWzRUnuG6FGV2Tbq9T7cxlHZKu8MpeYNoo9?=
 =?us-ascii?Q?bj3STgL6ZCIf2DXff/yMqx2Dq6wrY8KnBvBtVgFTs8+2QYfr+9I0xHJkkdLq?=
 =?us-ascii?Q?edr5Hr//jyx6U0oh+++biiJ7WZ8ivw6DiA9kxfBvKb7ivdM5sb3ruewUAxa5?=
 =?us-ascii?Q?hwbqEbORgK5jOROSKcdkCRB+TBFwRvCNrh0yyDlwR02KoRcslR/yNH3PzEWg?=
 =?us-ascii?Q?rLPjvfxtDwPHX2perhaGMPOJt3DxWPCv0ZaG284pPCs/vxifHkaT9WpGjO22?=
 =?us-ascii?Q?dKgsTuumCIl0zgWJQAJcA+JvKQH2GnU8ih8k6rd5Jujk4Og74CwMermESYsq?=
 =?us-ascii?Q?8HYu2gKJqhQVArl923Q8PsnylTSyIzcFX2D1csCYAhNm2c0J6uoCS97i6eOF?=
 =?us-ascii?Q?gonI2qOEDAaPMzHJGuy/pgQZyYYos5W/ppcoQPbG/nyh1Ku6lL4xVHu+If6M?=
 =?us-ascii?Q?bnM5XHvm5shIbwOh+prEW34J2rLL10FjriToTSE0TxxyJeyI+HzrjJ+gxgll?=
 =?us-ascii?Q?ZWayiiQPDuFXF1UBGx0dvr0GtA999gUeBPR6WAqxEqup27/ORL83gGsWGkxR?=
 =?us-ascii?Q?Ft0H2YrVyQce/+2ICnqvwCVEX6NdkRklQ5ut/cbk+bdZw5J7fmNkENwUwuTE?=
 =?us-ascii?Q?io8QUvYqE8n7DWarZjVRnvZh/2XJn8I74uAP7dRzeTo0exdIs0KF5RQ24l+d?=
 =?us-ascii?Q?/0MrS04uOi1IcnqwJH4Rd4bqGJRjoktuQyVGjKBTps+W+osUgF92M0jJL9y+?=
 =?us-ascii?Q?jHedaQybKDrPqVjmkMEhey7HzvXKeumKvzAyWfx9sgUvdZh83g=3D=3D?=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:;
 IPV:NLI; SFV:NSPM; H:SJ0PR11MB5918.namprd11.prod.outlook.com; PTR:; CAT:NONE;
 SFS:(13230040)(1800799024)(376014)(366016)(38070700018)(13003099007)(7053199007);
 DIR:OUT; SFP:1101; 
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?PgA9A0UPwZZIMx+IDuBhoOqiSPlSqUNMpCG90p63eLR9rByngaEjYMe2/NDk?=
 =?us-ascii?Q?gi6CujIBe2SkDwxRavp7XvdJoqaVV6V9esBVvAokBJ/bQbJCyHaCAmcndYuB?=
 =?us-ascii?Q?TL0HKuXv4XvqMt6/glNU5BZc/3h/rnlJTv9CQ/7pOgrknUlGDp6Yujto0zod?=
 =?us-ascii?Q?LkaWMzwPqS729457ail61TrWP28YKKVSYNoSW+XuYqdXbMh4NMzxZVdMLmX0?=
 =?us-ascii?Q?2Sj0v+2is+IYFo9z5ZlWvZjErUz9f+r10J8taxvkuqTtlWP5jmv56ZeVNMtC?=
 =?us-ascii?Q?OMU+82OMeQplbjmCLdEzV90/y2qhYaZkbwqZUuqRdguUai0iML2KHFdMQIGW?=
 =?us-ascii?Q?L6vCm7SnXrFuplky3t9m3Jgplb/CODwivXLURvjr4vCBo2Cxfk2zjrb7P9M6?=
 =?us-ascii?Q?RAVG2btcic4kR8IsnEoODEKveW48pBEmZpTw+aLehksYNa0ibDhvLlopj1N+?=
 =?us-ascii?Q?ASlS25mYBFZHuw9+QPS/1wjvqmn74l6m2zY5Yo1laSS5ESFDTWCTu4eSuHd9?=
 =?us-ascii?Q?LVbE6xwIfsOXOVyfc9mI5nC6bOfTKFg18sN/EgX5d1jsPQL6atTedFGpd1NR?=
 =?us-ascii?Q?Ue2BuokLTfdONxFB5YvNjj1laXv3dVzds6h0GP8PFf8yopt6LTETfB80g3rz?=
 =?us-ascii?Q?uQO5xQqLwY07ICPaysUOCCaYav/K59PsiaV3qVBLl9R5jpRcINRjnD1VvwA/?=
 =?us-ascii?Q?QKkLRUs2mY+1CqwQ8obNkjMw/s4aKlUDD6jcpt7FK1xPz2urPWFZsz0flfdL?=
 =?us-ascii?Q?dxB/FkfpSxslIhlS72vi95AzZ+GyudkMfR7V/ue8dMA+ZPKOxbdhc7EEfULn?=
 =?us-ascii?Q?ZcUPBcjxE8ge2F2GAatn0uKVU+t5oHVSJld8UL1VXdwCdKd9TUNXe5FKuQ51?=
 =?us-ascii?Q?JtQAfuJsYHVaNbQF767VPIQ8SW9NAnlaBv0u1yUZmT1tJmcy4KzgHDHMJaKM?=
 =?us-ascii?Q?6uFgFDco6f/kn9vp6UKhRDcIpcRnuGAJjuYyapwWZrZ/dGFnsORP9PXXPVoI?=
 =?us-ascii?Q?eKbRPMa0NIxdRjxUgOGt5SZehde6uemSbnu6zGxtO7Tb0pV/L/31bNVODPz5?=
 =?us-ascii?Q?9D+GZS6Jgpr46sh+PUdQNO67OljW1viDuiHBE0QTr4hQEgQvVvZMMlfhM2rC?=
 =?us-ascii?Q?gg7nnLtPkQwoxZB2RtkqauCzXqpUs0m7MeIK3BNEIhgFirPICSostxHpNKgf?=
 =?us-ascii?Q?iQv+V8KBsbTZYCye3HXr09XrTduV6uqMcM3cwHequASzuzRPgwIKoIVkTZkm?=
 =?us-ascii?Q?7V0NWzwAq7Ky/9pub4b87bfeS+VRIXHvPJxfXs+/rDoxjqa4RCFxUILkYF4c?=
 =?us-ascii?Q?W+vK5qNx9iH14JlDHabsVTO4fjbb3Qs/3MLkVDHczJ6z0LDmvwElOJTStwuJ?=
 =?us-ascii?Q?y8iF6t4BJ8wGsDJXEby61XhAoMv+kI7xfAGydmTfjUd6h1n+T44VzF8HSe9O?=
 =?us-ascii?Q?e6qcbNkPGtevu5dvfwK0J66XHcbBMC9Y3FmfEtsDUuFXgxUlten6tWKCctOx?=
 =?us-ascii?Q?S3D4/5OTwbu6EEhoeDphllQa8SGbkgdKH/ZnwUfPxvKElEA+TzPL0+fu1YLB?=
 =?us-ascii?Q?uFf5f0v/ey5OyhRAiM7vCk6CtpTM0kI3dv8PtazO?=
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ0PR11MB5918.namprd11.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: cae28fb7-7a16-4e05-ac89-08dd3eab62d5
X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Jan 2025 08:19:57.4573 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: fZa2k8eTkmoQclJcCNxl2llgn5gEeJN0vg5Aj3pmTmCb6e59m8PPM/ehZVLGrBw2r0UKmVgAl4kRPe9x98jRSA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY5PR11MB6440
X-OriginatorOrg: intel.com
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Hi,=20

Thanks for the review and feedback.

Below I have addressed your comments inline.

-----Original Message-----
From: Richardson, Bruce <bruce.richardson@intel.com>=20
Sent: Monday, January 20, 2025 7:46 PM
To: Wani, Shaiq <shaiq.wani@intel.com>
Cc: dev@dpdk.org; Singh, Aman Deep <aman.deep.singh@intel.com>
Subject: Re: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx

On Wed, Jan 08, 2025 at 05:47:56PM +0530, Shaiq Wani wrote:
> In case some CPUs don't support AVX512. Enable AVX2 for them to get=20
> better per-core performance.
>=20
> Signed-off-by: Shaiq Wani <shaiq.wani@intel.com>

Hi,

some review comments inline below.

/Bruce

> ---
>  drivers/common/idpf/idpf_common_device.h    |   1 +
>  drivers/common/idpf/idpf_common_rxtx.h      |   4 +
>  drivers/common/idpf/idpf_common_rxtx_avx2.c | 590 ++++++++++++++++++++
>  drivers/common/idpf/meson.build             |  15 +
>  drivers/common/idpf/version.map             |   1 +
>  drivers/net/idpf/idpf_rxtx.c                |  12 +
>  6 files changed, 623 insertions(+)
>  create mode 100644 drivers/common/idpf/idpf_common_rxtx_avx2.c
>=20
> diff --git a/drivers/common/idpf/idpf_common_device.h=20
> b/drivers/common/idpf/idpf_common_device.h
> index bfa927a5ff..734be1c88a 100644
> --- a/drivers/common/idpf/idpf_common_device.h
> +++ b/drivers/common/idpf/idpf_common_device.h
> @@ -123,6 +123,7 @@ struct idpf_vport {
> =20
>  	bool rx_vec_allowed;
>  	bool tx_vec_allowed;
> +	bool rx_use_avx2;
>  	bool rx_use_avx512;
>  	bool tx_use_avx512;
> =20
> diff --git a/drivers/common/idpf/idpf_common_rxtx.h=20
> b/drivers/common/idpf/idpf_common_rxtx.h
> index eeeeed12e2..f50cf5ef46 100644
> --- a/drivers/common/idpf/idpf_common_rxtx.h
> +++ b/drivers/common/idpf/idpf_common_rxtx.h
> @@ -302,5 +302,9 @@ uint16_t idpf_dp_splitq_xmit_pkts_avx512(void=20
> *tx_queue, struct rte_mbuf **tx_pk  __rte_internal  uint16_t=20
> idpf_dp_singleq_recv_scatter_pkts(void *rx_queue, struct rte_mbuf **rx_pk=
ts,
>  			  uint16_t nb_pkts);
> +__rte_internal
> +uint16_t idpf_dp_singleq_recv_pkts_avx2(void *rx_queue,
> +					struct rte_mbuf **rx_pkts,
> +					uint16_t nb_pkts);
> =20

I'm a little confused by the "singleq" part of the name here, can you expla=
in a little (in the commit message, perhaps) what is the "single"
referring to? Does the driver have the ability to poll multiple queues at o=
nce or something?
[SHAIQ] - Idpf supports singleq and splitq models. In the singleq model, pa=
ckets are processed and stored in order within a single RX queue. This will=
 be explicitly mentioned in v2 of the patch.

>  #endif /* _IDPF_COMMON_RXTX_H_ */
> diff --git a/drivers/common/idpf/idpf_common_rxtx_avx2.c=20
> b/drivers/common/idpf/idpf_common_rxtx_avx2.c
> new file mode 100644
> index 0000000000..a05b26c68a
> --- /dev/null
> +++ b/drivers/common/idpf/idpf_common_rxtx_avx2.c
> @@ -0,0 +1,590 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Intel Corporation  */
> +
> +#include <rte_vect.h>
> +
> +#include "idpf_common_rxtx.h"
> +#include "idpf_common_device.h"
> +
> +#ifndef __INTEL_COMPILER
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#endif

There is work ongoing to stop using this warning removal [1]. This code may=
 need to be rebased on top of that if it's applied soon.

[1] https://patches.dpdk.org/project/dpdk/list/?series=3D34390
[SHAIQ]- As for the warnings, we will address them once the patchset [https=
://patches.dpdk.org/project/dpdk/list/?series=3D34390] is merged.

> +
> +static __rte_always_inline void
> +idpf_singleq_rx_rearm(struct idpf_rx_queue *rxq) {
> +	int i;
> +	uint16_t rx_id;
> +	volatile union virtchnl2_rx_desc *rxdp =3D rxq->rx_ring;
> +	struct rte_mbuf **rxep =3D &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	rxdp +=3D rxq->rxrearm_start;
> +
> +	/* Pull 'n' more MBUFs into the software ring */
> +	if (rte_mempool_get_bulk(rxq->mp,
> +				 (void *)rxep,
> +				 IDPF_RXQ_REARM_THRESH) < 0) {
> +		if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=3D
> +		    rxq->nb_rx_desc) {
> +			__m128i dma_addr0;
> +
> +			dma_addr0 =3D _mm_setzero_si128();
> +			for (i =3D 0; i < IDPF_VPMD_DESCS_PER_LOOP; i++) {
> +				rxep[i] =3D &rxq->fake_mbuf;
> +				_mm_store_si128((__m128i *)&rxdp[i].read,
> +						dma_addr0);
> +			}
> +		}
> +		rte_atomic_fetch_add_explicit(&rxq->rx_stats.mbuf_alloc_failed,
> +				   IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed);
> +
> +		return;
> +	}
> +
> +	struct rte_mbuf *mb0, *mb1;
> +	__m128i dma_addr0, dma_addr1;
> +	__m128i hdr_room =3D _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
> +			RTE_PKTMBUF_HEADROOM);
> +	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
> +	for (i =3D 0; i < IDPF_RXQ_REARM_THRESH; i +=3D 2, rxep +=3D 2) {
> +		__m128i vaddr0, vaddr1;
> +
> +		mb0 =3D rxep[0];
> +		mb1 =3D rxep[1];
> +
> +		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=3D
> +				offsetof(struct rte_mbuf, buf_addr) + 8);
> +		vaddr0 =3D _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +		vaddr1 =3D _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +
> +		/* convert pa to dma_addr hdr/data */
> +		dma_addr0 =3D _mm_unpackhi_epi64(vaddr0, vaddr0);
> +		dma_addr1 =3D _mm_unpackhi_epi64(vaddr1, vaddr1);
> +
> +		/* add headroom to pa values */
> +		dma_addr0 =3D _mm_add_epi64(dma_addr0, hdr_room);
> +		dma_addr1 =3D _mm_add_epi64(dma_addr1, hdr_room);
> +
> +		/* flush desc with pa dma_addr */
> +		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> +		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
> +	}
> +
> +	rxq->rxrearm_start +=3D IDPF_RXQ_REARM_THRESH;
> +	if (rxq->rxrearm_start >=3D rxq->nb_rx_desc)
> +		rxq->rxrearm_start =3D 0;
> +
> +	rxq->rxrearm_nb -=3D IDPF_RXQ_REARM_THRESH;
> +
> +	rx_id =3D (uint16_t)((rxq->rxrearm_start =3D=3D 0) ?
> +			     (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
> +
> +	/* Update the tail pointer on the NIC */
> +	IDPF_PCI_REG_WRITE(rxq->qrx_tail, rx_id); }
> +
> +static inline uint16_t
> +_idpf_singleq_recv_raw_pkts_vec_avx2(struct idpf_rx_queue *rxq, struct r=
te_mbuf **rx_pkts,
> +				     uint16_t nb_pkts, uint8_t *split_packet) { #define=20
> +IDPF_DESCS_PER_LOOP_AVX 8
> +
> +	const uint32_t *ptype_tbl =3D rxq->adapter->ptype_tbl;
> +	const __m256i mbuf_init =3D _mm256_set_epi64x(0, 0,
> +			0, rxq->mbuf_initializer);
> +	struct rte_mbuf **sw_ring =3D &rxq->sw_ring[rxq->rx_tail];
> +	volatile union virtchnl2_rx_desc *rxdp =3D rxq->rx_ring;
> +	const int avx_aligned =3D ((rxq->rx_tail & 1) =3D=3D 0);
> +
> +	rxdp +=3D rxq->rx_tail;
> +
> +	rte_prefetch0(rxdp);
> +
> +	/* nb_pkts has to be floor-aligned to IDPF_DESCS_PER_LOOP_AVX */
> +	nb_pkts =3D RTE_ALIGN_FLOOR(nb_pkts, IDPF_DESCS_PER_LOOP_AVX);
> +
> +	/* See if we need to rearm the RX queue - gives the prefetch a bit
> +	 * of time to act
> +	 */
> +	if (rxq->rxrearm_nb > IDPF_RXQ_REARM_THRESH)
> +		idpf_singleq_rx_rearm(rxq);
> +
> +	/* Before we start moving massive data around, check to see if
> +	 * there is actually a packet available
> +	 */
> +	if (!(rxdp->flex_nic_wb.status_error0 &
> +			rte_cpu_to_le_32(1 << VIRTCHNL2_RX_FLEX_DESC_STATUS0_DD_S)))
> +		return 0;
> +
> +	/* 8 packets DD mask, LSB in each 32-bit value */
> +	const __m256i dd_check =3D _mm256_set1_epi32(1);
> +
> +	/* 8 packets EOP mask, second-LSB in each 32-bit value */
> +	const __m256i eop_check =3D _mm256_slli_epi32(dd_check,
> +			VIRTCHNL2_RX_FLEX_DESC_STATUS0_EOF_S);
> +
> +	/* mask to shuffle from desc. to mbuf (2 descriptors)*/
> +	const __m256i shuf_msk =3D
> +		_mm256_set_epi8
> +			(/* first descriptor */
> +			 0xFF, 0xFF,
> +			 0xFF, 0xFF,	/* rss hash parsed separately */
> +			 11, 10,	/* octet 10~11, 16 bits vlan_macip */
> +			 5, 4,		/* octet 4~5, 16 bits data_len */
> +			 0xFF, 0xFF,	/* skip hi 16 bits pkt_len, zero out */
> +			 5, 4,		/* octet 4~5, 16 bits pkt_len */
> +			 0xFF, 0xFF,	/* pkt_type set as unknown */
> +			 0xFF, 0xFF,	/*pkt_type set as unknown */
> +			 /* second descriptor */
> +			 0xFF, 0xFF,
> +			 0xFF, 0xFF,	/* rss hash parsed separately */
> +			 11, 10,	/* octet 10~11, 16 bits vlan_macip */
> +			 5, 4,		/* octet 4~5, 16 bits data_len */
> +			 0xFF, 0xFF,	/* skip hi 16 bits pkt_len, zero out */
> +			 5, 4,		/* octet 4~5, 16 bits pkt_len */
> +			 0xFF, 0xFF,	/* pkt_type set as unknown */
> +			 0xFF, 0xFF	/*pkt_type set as unknown */
> +			);
> +	/**
> +	 * compile-time check the above crc and shuffle layout is correct.
> +	 * NOTE: the first field (lowest address) is given last in set_epi
> +	 * calls above.
> +	 */
> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=3D
> +			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4);
> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=3D
> +			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=3D
> +			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10);
> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=3D
> +			offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12);
> +
> +	/* Status/Error flag masks */
> +	/**
> +	 * mask everything except Checksum Reports, RSS indication
> +	 * and VLAN indication.
> +	 * bit6:4 for IP/L4 checksum errors.
> +	 * bit12 is for RSS indication.
> +	 * bit13 is for VLAN indication.
> +	 */
> +	const __m256i flags_mask =3D
> +		 _mm256_set1_epi32((0xF << 4) | (1 << 12) | (1 << 13));
> +	/**
> +	 * data to be shuffled by the result of the flags mask shifted by 4
> +	 * bits.  This gives use the l3_l4 flags.
> +	 */
> +	const __m256i l3_l4_flags_shuf =3D
> +		_mm256_set_epi8((RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 |
> +		 RTE_MBUF_F_RX_OUTER_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD |
> +		  RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD  =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD  =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD=
 |
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD=
 |
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		/**
> +		 * second 128-bits
> +		 * shift right 20 bits to use the low two bits to indicate
> +		 * outer checksum status
> +		 * shift right 1 bit to make sure it not exceed 255
> +		 */
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM=
_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD  =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD  =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU=
M_BAD |
> +		 RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD =
|
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD=
 |
> +		 RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1,
> +		(RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD=
 |
> +		 RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1);
> +	const __m256i cksum_mask =3D
> +		 _mm256_set1_epi32(RTE_MBUF_F_RX_IP_CKSUM_MASK |
> +				   RTE_MBUF_F_RX_L4_CKSUM_MASK |
> +				   RTE_MBUF_F_RX_OUTER_IP_CKSUM_BAD |
> +				   RTE_MBUF_F_RX_OUTER_L4_CKSUM_MASK);
> +	/**
> +	 * data to be shuffled by result of flag mask, shifted down 12.
> +	 * If RSS(bit12)/VLAN(bit13) are set,
> +	 * shuffle moves appropriate flags in place.
> +	 */
> +	const __m256i rss_vlan_flags_shuf =3D _mm256_set_epi8(0, 0, 0, 0,
> +			0, 0, 0, 0,
> +			0, 0, 0, 0,
> +			RTE_MBUF_F_RX_RSS_HASH | RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRI=
PPED,
> +			RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED,
> +			RTE_MBUF_F_RX_RSS_HASH, 0,
> +			/* end up 128-bits */
> +			0, 0, 0, 0,
> +			0, 0, 0, 0,
> +			0, 0, 0, 0,
> +			RTE_MBUF_F_RX_RSS_HASH | RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRI=
PPED,
> +			RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED,
> +			RTE_MBUF_F_RX_RSS_HASH, 0);
> +
> +	RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this=20
> +*/
> +

Does the driver or HW support 16B descriptors? If not, just remove this var=
iable completely. Don't keep it just for consistency with other drivers.
[SHAIQ]- The support for 16B descriptors will be retained since they are su=
pported by the HW.
> +	uint16_t i, received;
> +
> +	for (i =3D 0, received =3D 0; i < nb_pkts;
> +	     i +=3D IDPF_DESCS_PER_LOOP_AVX,
> +	     rxdp +=3D IDPF_DESCS_PER_LOOP_AVX) {
> +		/* step 1, copy over 8 mbuf pointers to rx_pkts array */
> +		_mm256_storeu_si256((void *)&rx_pkts[i],
> +				    _mm256_loadu_si256((void *)&sw_ring[i])); #ifdef=20
> +RTE_ARCH_X86_64
> +		_mm256_storeu_si256
> +			((void *)&rx_pkts[i + 4],
> +			 _mm256_loadu_si256((void *)&sw_ring[i + 4])); #endif
> +
> +		__m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7;
> +
> +		const __m128i raw_desc7 =3D
> +			_mm_load_si128((void *)(rxdp + 7));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc6 =3D
> +			_mm_load_si128((void *)(rxdp + 6));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc5 =3D
> +			_mm_load_si128((void *)(rxdp + 5));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc4 =3D
> +			_mm_load_si128((void *)(rxdp + 4));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc3 =3D
> +			_mm_load_si128((void *)(rxdp + 3));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc2 =3D
> +			_mm_load_si128((void *)(rxdp + 2));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc1 =3D
> +			_mm_load_si128((void *)(rxdp + 1));
> +		rte_compiler_barrier();
> +		const __m128i raw_desc0 =3D
> +			_mm_load_si128((void *)(rxdp + 0));
> +

Here and a number of other places, I think you can reduce the amount of wor=
d-wrapping being done. Unlike when the first AVX2 code was being written, w=
e now can use up to 100 characters be line.
[SHAIQ]- We will reduce excessive word wrapping in some sections to enhance=
 readability.

> +		raw_desc6_7 =3D
> +			_mm256_inserti128_si256
> +				(_mm256_castsi128_si256(raw_desc6),
> +				 raw_desc7, 1);
> +		raw_desc4_5 =3D
> +			_mm256_inserti128_si256
> +				(_mm256_castsi128_si256(raw_desc4),
> +				 raw_desc5, 1);
> +		raw_desc2_3 =3D
> +			_mm256_inserti128_si256
> +				(_mm256_castsi128_si256(raw_desc2),
> +				 raw_desc3, 1);
> +		raw_desc0_1 =3D
> +			_mm256_inserti128_si256
> +				(_mm256_castsi128_si256(raw_desc0),
> +				 raw_desc1, 1);
> +
> +		if (split_packet) {
> +			int j;
> +
> +			for (j =3D 0; j < IDPF_DESCS_PER_LOOP_AVX; j++)
> +				rte_mbuf_prefetch_part2(rx_pkts[i + j]);
> +		}
> +

I don't see any use of buffer reassembly for multi-segment packets in the d=
river code. If it's not planned to add this, then you can remove this block=
. If it is planned to add it, then keep this, and you can base the implemen=
tation on the common function being extracted out of other drivers[2].

[2] https://patches.dpdk.org/project/dpdk/patch/20250120120016.1530274-3-br=
uce.richardson@intel.com/
[SHAIQ]- We will drop the code related to buffer reassembly for multi-segme=
nt packets, as per your suggestion.


> +		/**
> +		 * convert descriptors 4-7 into mbufs, re-arrange fields.
> +		 * Then write into the mbuf.
> +		 */
> +		__m256i mb6_7 =3D _mm256_shuffle_epi8(raw_desc6_7, shuf_msk);
> +		__m256i mb4_5 =3D _mm256_shuffle_epi8(raw_desc4_5, shuf_msk);
> +
> +		/**
> +		 * to get packet types, ptype is located in bit16-25
> +		 * of each 128bits
> +		 */
> +		const __m256i ptype_mask =3D
> +			_mm256_set1_epi16(VIRTCHNL2_RX_FLEX_DESC_PTYPE_M);
> +		const __m256i ptypes6_7 =3D
> +			_mm256_and_si256(raw_desc6_7, ptype_mask);
> +		const __m256i ptypes4_5 =3D
> +			_mm256_and_si256(raw_desc4_5, ptype_mask);
> +		const uint16_t ptype7 =3D _mm256_extract_epi16(ptypes6_7, 9);
> +		const uint16_t ptype6 =3D _mm256_extract_epi16(ptypes6_7, 1);
> +		const uint16_t ptype5 =3D _mm256_extract_epi16(ptypes4_5, 9);
> +		const uint16_t ptype4 =3D _mm256_extract_epi16(ptypes4_5, 1);
> +
> +		mb6_7 =3D _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4);
> +		mb6_7 =3D _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0);
> +		mb4_5 =3D _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4);
> +		mb4_5 =3D _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0);
> +		/* merge the status bits into one register */
> +		const __m256i status4_7 =3D _mm256_unpackhi_epi32(raw_desc6_7,
> +				raw_desc4_5);
> +
> +		/**
> +		 * convert descriptors 0-3 into mbufs, re-arrange fields.
> +		 * Then write into the mbuf.
> +		 */
> +		__m256i mb2_3 =3D _mm256_shuffle_epi8(raw_desc2_3, shuf_msk);
> +		__m256i mb0_1 =3D _mm256_shuffle_epi8(raw_desc0_1, shuf_msk);
> +
> +		/**
> +		 * to get packet types, ptype is located in bit16-25
> +		 * of each 128bits
> +		 */
> +		const __m256i ptypes2_3 =3D
> +			_mm256_and_si256(raw_desc2_3, ptype_mask);
> +		const __m256i ptypes0_1 =3D
> +			_mm256_and_si256(raw_desc0_1, ptype_mask);
> +		const uint16_t ptype3 =3D _mm256_extract_epi16(ptypes2_3, 9);
> +		const uint16_t ptype2 =3D _mm256_extract_epi16(ptypes2_3, 1);
> +		const uint16_t ptype1 =3D _mm256_extract_epi16(ptypes0_1, 9);
> +		const uint16_t ptype0 =3D _mm256_extract_epi16(ptypes0_1, 1);
> +
> +		mb2_3 =3D _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4);
> +		mb2_3 =3D _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0);
> +		mb0_1 =3D _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4);
> +		mb0_1 =3D _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0);
> +		/* merge the status bits into one register */
> +		const __m256i status0_3 =3D _mm256_unpackhi_epi32(raw_desc2_3,
> +								raw_desc0_1);
> +
> +		/**
> +		 * take the two sets of status bits and merge to one
> +		 * After merge, the packets status flags are in the
> +		 * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6]
> +		 */
> +		__m256i status0_7 =3D _mm256_unpacklo_epi64(status4_7,
> +							  status0_3);
> +
> +		/* now do flag manipulation */
> +
> +		/* get only flag/error bits we want */
> +		const __m256i flag_bits =3D
> +			_mm256_and_si256(status0_7, flags_mask);
> +		/**
> +		 * l3_l4_error flags, shuffle, then shift to correct adjustment
> +		 * of flags in flags_shuf, and finally mask out extra bits
> +		 */
> +		__m256i l3_l4_flags =3D _mm256_shuffle_epi8(l3_l4_flags_shuf,
> +				_mm256_srli_epi32(flag_bits, 4));
> +		l3_l4_flags =3D _mm256_slli_epi32(l3_l4_flags, 1);
> +
> +		__m256i l4_outer_mask =3D _mm256_set1_epi32(0x6);
> +		__m256i l4_outer_flags =3D
> +				_mm256_and_si256(l3_l4_flags, l4_outer_mask);
> +		l4_outer_flags =3D _mm256_slli_epi32(l4_outer_flags, 20);
> +
> +		__m256i l3_l4_mask =3D _mm256_set1_epi32(~0x6);
> +		l3_l4_flags =3D _mm256_and_si256(l3_l4_flags, l3_l4_mask);
> +		l3_l4_flags =3D _mm256_or_si256(l3_l4_flags, l4_outer_flags);
> +		l3_l4_flags =3D _mm256_and_si256(l3_l4_flags, cksum_mask);
> +		/* set rss and vlan flags */
> +		const __m256i rss_vlan_flag_bits =3D
> +			_mm256_srli_epi32(flag_bits, 12);
> +		const __m256i rss_vlan_flags =3D
> +			_mm256_shuffle_epi8(rss_vlan_flags_shuf,
> +					    rss_vlan_flag_bits);
> +
> +		/* merge flags */
> +		__m256i mbuf_flags =3D _mm256_or_si256(l3_l4_flags,
> +				rss_vlan_flags);
> +
> +		/**
> +		 * At this point, we have the 8 sets of flags in the low 16-bits
> +		 * of each 32-bit value in vlan0.
> +		 * We want to extract these, and merge them with the mbuf init
> +		 * data so we can do a single write to the mbuf to set the flags
> +		 * and all the other initialization fields. Extracting the
> +		 * appropriate flags means that we have to do a shift and blend
> +		 * for each mbuf before we do the write. However, we can also
> +		 * add in the previously computed rx_descriptor fields to
> +		 * make a single 256-bit write per mbuf
> +		 */
> +		/* check the structure matches expectations */
> +		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=3D
> +				 offsetof(struct rte_mbuf, rearm_data) + 8);
> +		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=3D
> +				 RTE_ALIGN(offsetof(struct rte_mbuf,
> +						    rearm_data),
> +					   16));
> +		/* build up data and do writes */
> +		__m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5,
> +			rearm6, rearm7;
> +		rearm6 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_slli_si256(mbuf_flags, 8),
> +					    0x04);
> +		rearm4 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_slli_si256(mbuf_flags, 4),
> +					    0x04);
> +		rearm2 =3D _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04);
> +		rearm0 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_srli_si256(mbuf_flags, 4),
> +					    0x04);
> +		/* permute to add in the rx_descriptor e.g. rss fields */
> +		rearm6 =3D _mm256_permute2f128_si256(rearm6, mb6_7, 0x20);
> +		rearm4 =3D _mm256_permute2f128_si256(rearm4, mb4_5, 0x20);
> +		rearm2 =3D _mm256_permute2f128_si256(rearm2, mb2_3, 0x20);
> +		rearm0 =3D _mm256_permute2f128_si256(rearm0, mb0_1, 0x20);
> +		/* write to mbuf */
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data,
> +				    rearm6);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data,
> +				    rearm4);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data,
> +				    rearm2);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data,
> +				    rearm0);
> +
> +		/* repeat for the odd mbufs */
> +		const __m256i odd_flags =3D
> +			_mm256_castsi128_si256
> +				(_mm256_extracti128_si256(mbuf_flags, 1));
> +		rearm7 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_slli_si256(odd_flags, 8),
> +					    0x04);
> +		rearm5 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_slli_si256(odd_flags, 4),
> +					    0x04);
> +		rearm3 =3D _mm256_blend_epi32(mbuf_init, odd_flags, 0x04);
> +		rearm1 =3D _mm256_blend_epi32(mbuf_init,
> +					    _mm256_srli_si256(odd_flags, 4),
> +					    0x04);
> +		/* since odd mbufs are already in hi 128-bits use blend */
> +		rearm7 =3D _mm256_blend_epi32(rearm7, mb6_7, 0xF0);
> +		rearm5 =3D _mm256_blend_epi32(rearm5, mb4_5, 0xF0);
> +		rearm3 =3D _mm256_blend_epi32(rearm3, mb2_3, 0xF0);
> +		rearm1 =3D _mm256_blend_epi32(rearm1, mb0_1, 0xF0);
> +		/* again write to mbufs */
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data,
> +				    rearm7);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data,
> +				    rearm5);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data,
> +				    rearm3);
> +		_mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data,
> +				    rearm1);
> +
> +		/* extract and record EOP bit */
> +		if (split_packet) {
> +			const __m128i eop_mask =3D
> +				_mm_set1_epi16(1 << VIRTCHNL2_RX_FLEX_DESC_STATUS0_EOF_S);
> +			const __m256i eop_bits256 =3D _mm256_and_si256(status0_7,
> +								     eop_check);
> +			/* pack status bits into a single 128-bit register */
> +			const __m128i eop_bits =3D
> +				_mm_packus_epi32
> +					(_mm256_castsi256_si128(eop_bits256),
> +					 _mm256_extractf128_si256(eop_bits256,
> +								  1));
> +			/**
> +			 * flip bits, and mask out the EOP bit, which is now
> +			 * a split-packet bit i.e. !EOP, rather than EOP one.
> +			 */
> +			__m128i split_bits =3D _mm_andnot_si128(eop_bits,
> +					eop_mask);
> +			/**
> +			 * eop bits are out of order, so we need to shuffle them
> +			 * back into order again. In doing so, only use low 8
> +			 * bits, which acts like another pack instruction
> +			 * The original order is (hi->lo): 1,3,5,7,0,2,4,6
> +			 * [Since we use epi8, the 16-bit positions are
> +			 * multiplied by 2 in the eop_shuffle value.]
> +			 */
> +			__m128i eop_shuffle =3D
> +				_mm_set_epi8(/* zero hi 64b */
> +					     0xFF, 0xFF, 0xFF, 0xFF,
> +					     0xFF, 0xFF, 0xFF, 0xFF,
> +					     /* move values to lo 64b */
> +					     8, 0, 10, 2,
> +					     12, 4, 14, 6);
> +			split_bits =3D _mm_shuffle_epi8(split_bits, eop_shuffle);
> +			*(uint64_t *)split_packet =3D
> +				_mm_cvtsi128_si64(split_bits);
> +			split_packet +=3D IDPF_DESCS_PER_LOOP_AVX;
> +		}

As above, if there are no plans to support multi-buffer packet reassembly, =
drop this.
[SHAIQ] - We plan to drop this .
> +
> +		/* perform dd_check */
> +		status0_7 =3D _mm256_and_si256(status0_7, dd_check);
> +		status0_7 =3D _mm256_packs_epi32(status0_7,
> +					       _mm256_setzero_si256());
> +
> +		uint64_t burst =3D rte_popcount64
> +					(_mm_cvtsi128_si64
> +						(_mm256_extracti128_si256
> +							(status0_7, 1)));
> +		burst +=3D rte_popcount64
> +				(_mm_cvtsi128_si64
> +					(_mm256_castsi256_si128(status0_7)));
> +		received +=3D burst;
> +		if (burst !=3D IDPF_DESCS_PER_LOOP_AVX)
> +			break;
> +	}
> +
> +	/* update tail pointers */
> +	rxq->rx_tail +=3D received;
> +	rxq->rx_tail &=3D (rxq->nb_rx_desc - 1);
> +	if ((rxq->rx_tail & 1) =3D=3D 1 && received > 1) { /* keep avx2 aligned=
 */
> +		rxq->rx_tail--;
> +		received--;
> +	}
> +	rxq->rxrearm_nb +=3D received;
> +	return received;
> +}
> +
> +/**
> + * Notice:
> + * - nb_pkts < IDPF_DESCS_PER_LOOP, just return no packet  */=20
> +uint16_t idpf_dp_singleq_recv_pkts_avx2(void *rx_queue, struct=20
> +rte_mbuf **rx_pkts,
> +			       uint16_t nb_pkts)
> +{
> +	return _idpf_singleq_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts,=20
> +nb_pkts, NULL); }
> diff --git a/drivers/common/idpf/meson.build=20
> b/drivers/common/idpf/meson.build index 46fd45c03b..4caa06a9b7 100644
> --- a/drivers/common/idpf/meson.build
> +++ b/drivers/common/idpf/meson.build
> @@ -16,6 +16,21 @@ sources =3D files(
>  )
> =20
>  if arch_subdir =3D=3D 'x86'
> +    # compile AVX2 version if either:
> +    # a. we have AVX supported in minimum instruction set baseline
> +    # b. it's not minimum instruction set, but supported by compiler
> +    if cc.get_define('__AVX2__', args: machine_args) !=3D ''
> +        cflags +=3D ['-DCC_AVX2_SUPPORT']
> +        sources +=3D files('idpf_common_rxtx_avx2.c')
> +    elif cc.has_argument('-mavx2')

This logic is out-of-date, since all supported compilers have AVX2 support.
Suggest reworking using drivers/net/ice/meson.build as a reference.
[SHAIQ]- The meson.build script will be reworked, using drivers/net/ice/mes=
on.build as a reference.

> +       cflags +=3D ['-DCC_AVX2_SUPPORT']
> +        idpf_avx2_lib =3D static_library('idpf_avx2_lib',
> +                'idpf_common_rxtx_avx2.c',
> +               dependencies: [static_rte_ethdev, static_rte_kvargs, stat=
ic_rte_hash],
> +                include_directories: includes,
> +                c_args: [cflags, '-mavx2'])
> +       objs +=3D idpf_avx2_lib.extract_objects('idpf_common_rxtx_avx2.c'=
)
> +    endif
>      if cc_has_avx512
>          cflags +=3D ['-DCC_AVX512_SUPPORT']
>          avx512_args =3D cflags + cc_avx512_flags diff --git=20
> a/drivers/common/idpf/version.map b/drivers/common/idpf/version.map=20
> index 0729f6b912..4510aae6b3 100644
> --- a/drivers/common/idpf/version.map
> +++ b/drivers/common/idpf/version.map
> @@ -14,6 +14,7 @@ INTERNAL {
>  	idpf_dp_splitq_recv_pkts_avx512;
>  	idpf_dp_splitq_xmit_pkts;
>  	idpf_dp_splitq_xmit_pkts_avx512;
> +	idpf_dp_singleq_recv_pkts_avx2;
> =20

This list should be alphabetical, so singleq should go before splitq.
[SHAIQ]- will address the issue in v2.

>  	idpf_qc_rx_thresh_check;
>  	idpf_qc_rx_queue_release;
> diff --git a/drivers/net/idpf/idpf_rxtx.c=20
> b/drivers/net/idpf/idpf_rxtx.c index 858bbefe3b..80c6c325e8 100644
> --- a/drivers/net/idpf/idpf_rxtx.c
> +++ b/drivers/net/idpf/idpf_rxtx.c
> @@ -776,6 +776,11 @@ idpf_set_rx_function(struct rte_eth_dev *dev)
>  	    rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_128) {
>  		vport->rx_vec_allowed =3D true;
> =20
> +		if ((rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) =3D=3D 1 ||
> +		     rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) =3D=3D 1) &&

There are no CPUs that support AVX512 witout supporting AVX2 - and if there=
 were we probably couldn't use an AVX2 code path on them anyway. Therefore =
only check the AVX2 flag and the bitwidth.
[SHAIQ]- Will address in v2.

> +		    rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_256)
> +			vport->rx_use_avx2 =3D true;
> +
>  		if (rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_512)  #ifdef=20
> CC_AVX512_SUPPORT
>  			if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) =3D=3D 1 && @@=20
> -827,6 +832,13 @@ idpf_set_rx_function(struct rte_eth_dev *dev)
>  				return;
>  			}
>  #endif /* CC_AVX512_SUPPORT */
> +			if (vport->rx_use_avx2) {
> +				PMD_DRV_LOG(NOTICE,
> +					    "Using Single AVX2 Vector Rx (port %d).",
> +					    dev->data->port_id);
> +				dev->rx_pkt_burst =3D idpf_dp_singleq_recv_pkts_avx2;
> +				return;
> +			}
>  		}
> =20
>  		if (dev->data->scattered_rx) {
> --
> 2.34.1
>=20