From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 2F91746138; Mon, 27 Jan 2025 09:20:24 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id EAD8F4060B; Mon, 27 Jan 2025 09:20:23 +0100 (CET) Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) by mails.dpdk.org (Postfix) with ESMTP id A65B1402A4 for ; Mon, 27 Jan 2025 09:20:21 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1737966022; x=1769502022; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=dJAzrcU+9t/Yl8mqnYhenamvFfyUXk+m/tp/0/JvK0g=; b=lJdXhODVniZhvjHrg7lYrp/ru0gqFMr9LCfytWq4x8U8S586m9JNPUM0 8iXW+grK22lJjtEKe7eRpe9pNbUVdXiYpwqnaQqyGxRp/AKN2b8NttBQY tfmYTi69Lz1oX7n68FF7s+YiJNHXizOOlLzE/OBAtCekm2L5fNUh1pxxE aZGn9+Hvx/rphMvOYihJDBMbrSJc8Wxr7Q5nO1pJSxI4Jju5ENXm3+QsP +ahb+zIMe96t+AjmDgr64qwWPnCaEQ1LBw66QRgiijzOoD7UM86ZeLr8o 2UF5I4KVGtg3TrZP79DsyMmz0DdXAHnZkiGnAiRO/q+djX6MiHkxdl882 Q==; X-CSE-ConnectionGUID: uBk66sxWToyrq9fhfy9mbA== X-CSE-MsgGUID: 84+qs+MoRMC81tvcbY/GKg== X-IronPort-AV: E=McAfee;i="6700,10204,11327"; a="55840820" X-IronPort-AV: E=Sophos;i="6.13,237,1732608000"; d="scan'208";a="55840820" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jan 2025 00:20:21 -0800 X-CSE-ConnectionGUID: 8lOmNeIeSoKkdBvY8WLi1w== X-CSE-MsgGUID: IWsyRZS0Q7asWQRQXYeN1A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,237,1732608000"; d="scan'208";a="108958732" Received: from orsmsx603.amr.corp.intel.com ([10.22.229.16]) by fmviesa009.fm.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 27 Jan 2025 00:20:20 -0800 Received: from orsmsx601.amr.corp.intel.com (10.22.229.14) by ORSMSX603.amr.corp.intel.com (10.22.229.16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.44; Mon, 27 Jan 2025 00:20:19 -0800 Received: from orsedg603.ED.cps.intel.com (10.7.248.4) by orsmsx601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.44 via Frontend Transport; Mon, 27 Jan 2025 00:20:19 -0800 Received: from NAM11-BN8-obe.outbound.protection.outlook.com (104.47.58.176) by edgegateway.intel.com (134.134.137.100) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.44; Mon, 27 Jan 2025 00:20:19 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=GD7K98plYImWRm33uOFWH3HSJHoZ/UMmCZDcw/1NyeoXe47TcToKexqfKxbSJEKfBZJU3EtvFsxAK0qIsXv6ryvlOGkNniDkRArXIrbRT8dyPp13MvUd/Xw0RxfpAnmP+k25r5GxKCdPlWy4tH3JJdyFgmwKvAo0x7cB4iEe5OLwZ31ehDcMa4jNpSASxE4SgdTiy325YT9MG1kk11YCOXjUsxi0qEiTJK9bmlLCEReuIxiWrfAABXHm60nPH/cGen7QbSseqchJvZw+zzhtBkIzp1YqlEXcAUSqEaUyXjvOMJs3lgBpvbt8GNqhDlGooelAaePSC9crCAyqYUH1fQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=LKznPef1QOazfrmHoZQ8X+/L2dBFCEslJKefP7urM9I=; b=R3Govihtk5F+WaBPXEuiTG1P6SwZEmWdsW444y/f8p413V+mMdgx3HnTlUBPGZUpLF5XnJ4YNyRKS3sR37qpAW9dwH8lES84sI3QKugnEh16rmt/wsVGQsmIMlqGpgi0Q09wHNDcLhPuoEUm4cBdD7YZxD9uGNJeEudtKyv/dd9SWWpA12PXcSqOefRHAa3j6sMu4DO0Xx2w3YjJDaS5cTBcpSmHHHg6GoCYPjwxZF78neCVX7FSq7O+ER9WX67JU9P8J0/g0xCit+4+rV/Auz4u+/VP0pYMNObDaDdO+FzklzDHYj8tcEJ4Tb2qUHZ0mePywk9sfHUlni81JlBcDA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from SJ0PR11MB5918.namprd11.prod.outlook.com (2603:10b6:a03:42c::22) by CY5PR11MB6440.namprd11.prod.outlook.com (2603:10b6:930:33::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8377.22; Mon, 27 Jan 2025 08:19:57 +0000 Received: from SJ0PR11MB5918.namprd11.prod.outlook.com ([fe80::891b:9bb3:428a:c72a]) by SJ0PR11MB5918.namprd11.prod.outlook.com ([fe80::891b:9bb3:428a:c72a%6]) with mapi id 15.20.8377.021; Mon, 27 Jan 2025 08:19:57 +0000 From: "Wani, Shaiq" To: "Richardson, Bruce" CC: "dev@dpdk.org" , "Singh, Aman Deep" Subject: RE: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx Thread-Topic: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx Thread-Index: AQHbYccymlq2RDD3LUWj5MZjczserrMfyAGAgAqaPbA= Date: Mon, 27 Jan 2025 08:19:57 +0000 Message-ID: References: <20250108121757.170494-1-shaiq.wani@intel.com> <20250108121757.170494-2-shaiq.wani@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: SJ0PR11MB5918:EE_|CY5PR11MB6440:EE_ x-ms-office365-filtering-correlation-id: cae28fb7-7a16-4e05-ac89-08dd3eab62d5 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018|13003099007|7053199007; x-microsoft-antispam-message-info: =?us-ascii?Q?kdSFLrmIAaQFEzugYdA3E1v0OX5DRYYATbRkkGxf0QONoSioFAXvp+LAI88L?= =?us-ascii?Q?8H55H17RW5R88amoKhiWuST4xTqqYBoc3Wj18jKs/Y5Fbx3CGloC+/jVlwHr?= =?us-ascii?Q?UjkcdhpHvQaW1ruiA9L38zyYgubywW6xt4oZ1b+cFtwadC7J8tW7Fv3/sB3u?= =?us-ascii?Q?Z+K9t2to3q6uB5GUXGPsvmqXcZCVI8qBcMbX9W787tnuptgXMmE+GFDEIhf5?= =?us-ascii?Q?0Fj0IBjE5ACOk3oKk35bCC9mm/Za7Izhu/HEwwUAet3c7IrSg3C5hF2qcaI8?= =?us-ascii?Q?if+vyVWxhfpIhHZBy8OmSUDhTJyv33TsynD6DDqL1LhBaU5LoCv4ZLPF3kDg?= =?us-ascii?Q?d4N3X/rZNI5z6husEgzh96s4JHDkv3AfpMwj5W7QwbXowlFu86dy7Bn8a1KS?= =?us-ascii?Q?c5SQ8B9xCSsmIeA2sbxXEnujqnhVUMIQ54ywck/tuEXI3ImmyBuXbyq0Ogk1?= =?us-ascii?Q?6WC/ZGOeG2CVvdPH8GKXfDidVNNehA82jGMZ+HKKRy+BMfucjqulg5DqamQe?= =?us-ascii?Q?2Xh8HhDAHF3j3dIs54WZemZnJrNJWw5Bp+wkVAe22v2pllS57dOoBlRzxfpq?= =?us-ascii?Q?5BFEYI9mdypC6bu68qLAln8WSVWzRUnuG6FGV2Tbq9T7cxlHZKu8MpeYNoo9?= =?us-ascii?Q?bj3STgL6ZCIf2DXff/yMqx2Dq6wrY8KnBvBtVgFTs8+2QYfr+9I0xHJkkdLq?= =?us-ascii?Q?edr5Hr//jyx6U0oh+++biiJ7WZ8ivw6DiA9kxfBvKb7ivdM5sb3ruewUAxa5?= =?us-ascii?Q?hwbqEbORgK5jOROSKcdkCRB+TBFwRvCNrh0yyDlwR02KoRcslR/yNH3PzEWg?= =?us-ascii?Q?rLPjvfxtDwPHX2perhaGMPOJt3DxWPCv0ZaG284pPCs/vxifHkaT9WpGjO22?= =?us-ascii?Q?dKgsTuumCIl0zgWJQAJcA+JvKQH2GnU8ih8k6rd5Jujk4Og74CwMermESYsq?= =?us-ascii?Q?8HYu2gKJqhQVArl923Q8PsnylTSyIzcFX2D1csCYAhNm2c0J6uoCS97i6eOF?= =?us-ascii?Q?gonI2qOEDAaPMzHJGuy/pgQZyYYos5W/ppcoQPbG/nyh1Ku6lL4xVHu+If6M?= =?us-ascii?Q?bnM5XHvm5shIbwOh+prEW34J2rLL10FjriToTSE0TxxyJeyI+HzrjJ+gxgll?= =?us-ascii?Q?ZWayiiQPDuFXF1UBGx0dvr0GtA999gUeBPR6WAqxEqup27/ORL83gGsWGkxR?= =?us-ascii?Q?Ft0H2YrVyQce/+2ICnqvwCVEX6NdkRklQ5ut/cbk+bdZw5J7fmNkENwUwuTE?= =?us-ascii?Q?io8QUvYqE8n7DWarZjVRnvZh/2XJn8I74uAP7dRzeTo0exdIs0KF5RQ24l+d?= =?us-ascii?Q?/0MrS04uOi1IcnqwJH4Rd4bqGJRjoktuQyVGjKBTps+W+osUgF92M0jJL9y+?= =?us-ascii?Q?jHedaQybKDrPqVjmkMEhey7HzvXKeumKvzAyWfx9sgUvdZh83g=3D=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:SJ0PR11MB5918.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018)(13003099007)(7053199007); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?PgA9A0UPwZZIMx+IDuBhoOqiSPlSqUNMpCG90p63eLR9rByngaEjYMe2/NDk?= =?us-ascii?Q?gi6CujIBe2SkDwxRavp7XvdJoqaVV6V9esBVvAokBJ/bQbJCyHaCAmcndYuB?= =?us-ascii?Q?TL0HKuXv4XvqMt6/glNU5BZc/3h/rnlJTv9CQ/7pOgrknUlGDp6Yujto0zod?= =?us-ascii?Q?LkaWMzwPqS729457ail61TrWP28YKKVSYNoSW+XuYqdXbMh4NMzxZVdMLmX0?= =?us-ascii?Q?2Sj0v+2is+IYFo9z5ZlWvZjErUz9f+r10J8taxvkuqTtlWP5jmv56ZeVNMtC?= =?us-ascii?Q?OMU+82OMeQplbjmCLdEzV90/y2qhYaZkbwqZUuqRdguUai0iML2KHFdMQIGW?= =?us-ascii?Q?L6vCm7SnXrFuplky3t9m3Jgplb/CODwivXLURvjr4vCBo2Cxfk2zjrb7P9M6?= =?us-ascii?Q?RAVG2btcic4kR8IsnEoODEKveW48pBEmZpTw+aLehksYNa0ibDhvLlopj1N+?= =?us-ascii?Q?ASlS25mYBFZHuw9+QPS/1wjvqmn74l6m2zY5Yo1laSS5ESFDTWCTu4eSuHd9?= =?us-ascii?Q?LVbE6xwIfsOXOVyfc9mI5nC6bOfTKFg18sN/EgX5d1jsPQL6atTedFGpd1NR?= =?us-ascii?Q?Ue2BuokLTfdONxFB5YvNjj1laXv3dVzds6h0GP8PFf8yopt6LTETfB80g3rz?= =?us-ascii?Q?uQO5xQqLwY07ICPaysUOCCaYav/K59PsiaV3qVBLl9R5jpRcINRjnD1VvwA/?= =?us-ascii?Q?QKkLRUs2mY+1CqwQ8obNkjMw/s4aKlUDD6jcpt7FK1xPz2urPWFZsz0flfdL?= =?us-ascii?Q?dxB/FkfpSxslIhlS72vi95AzZ+GyudkMfR7V/ue8dMA+ZPKOxbdhc7EEfULn?= =?us-ascii?Q?ZcUPBcjxE8ge2F2GAatn0uKVU+t5oHVSJld8UL1VXdwCdKd9TUNXe5FKuQ51?= =?us-ascii?Q?JtQAfuJsYHVaNbQF767VPIQ8SW9NAnlaBv0u1yUZmT1tJmcy4KzgHDHMJaKM?= =?us-ascii?Q?6uFgFDco6f/kn9vp6UKhRDcIpcRnuGAJjuYyapwWZrZ/dGFnsORP9PXXPVoI?= =?us-ascii?Q?eKbRPMa0NIxdRjxUgOGt5SZehde6uemSbnu6zGxtO7Tb0pV/L/31bNVODPz5?= =?us-ascii?Q?9D+GZS6Jgpr46sh+PUdQNO67OljW1viDuiHBE0QTr4hQEgQvVvZMMlfhM2rC?= =?us-ascii?Q?gg7nnLtPkQwoxZB2RtkqauCzXqpUs0m7MeIK3BNEIhgFirPICSostxHpNKgf?= =?us-ascii?Q?iQv+V8KBsbTZYCye3HXr09XrTduV6uqMcM3cwHequASzuzRPgwIKoIVkTZkm?= =?us-ascii?Q?7V0NWzwAq7Ky/9pub4b87bfeS+VRIXHvPJxfXs+/rDoxjqa4RCFxUILkYF4c?= =?us-ascii?Q?W+vK5qNx9iH14JlDHabsVTO4fjbb3Qs/3MLkVDHczJ6z0LDmvwElOJTStwuJ?= =?us-ascii?Q?y8iF6t4BJ8wGsDJXEby61XhAoMv+kI7xfAGydmTfjUd6h1n+T44VzF8HSe9O?= =?us-ascii?Q?e6qcbNkPGtevu5dvfwK0J66XHcbBMC9Y3FmfEtsDUuFXgxUlten6tWKCctOx?= =?us-ascii?Q?S3D4/5OTwbu6EEhoeDphllQa8SGbkgdKH/ZnwUfPxvKElEA+TzPL0+fu1YLB?= =?us-ascii?Q?uFf5f0v/ey5OyhRAiM7vCk6CtpTM0kI3dv8PtazO?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SJ0PR11MB5918.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: cae28fb7-7a16-4e05-ac89-08dd3eab62d5 X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Jan 2025 08:19:57.4573 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: fZa2k8eTkmoQclJcCNxl2llgn5gEeJN0vg5Aj3pmTmCb6e59m8PPM/ehZVLGrBw2r0UKmVgAl4kRPe9x98jRSA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY5PR11MB6440 X-OriginatorOrg: intel.com X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi,=20 Thanks for the review and feedback. Below I have addressed your comments inline. -----Original Message----- From: Richardson, Bruce =20 Sent: Monday, January 20, 2025 7:46 PM To: Wani, Shaiq Cc: dev@dpdk.org; Singh, Aman Deep Subject: Re: [PATCH 1/2] common/idpf: enable AVX2 for single queue Rx On Wed, Jan 08, 2025 at 05:47:56PM +0530, Shaiq Wani wrote: > In case some CPUs don't support AVX512. Enable AVX2 for them to get=20 > better per-core performance. >=20 > Signed-off-by: Shaiq Wani Hi, some review comments inline below. /Bruce > --- > drivers/common/idpf/idpf_common_device.h | 1 + > drivers/common/idpf/idpf_common_rxtx.h | 4 + > drivers/common/idpf/idpf_common_rxtx_avx2.c | 590 ++++++++++++++++++++ > drivers/common/idpf/meson.build | 15 + > drivers/common/idpf/version.map | 1 + > drivers/net/idpf/idpf_rxtx.c | 12 + > 6 files changed, 623 insertions(+) > create mode 100644 drivers/common/idpf/idpf_common_rxtx_avx2.c >=20 > diff --git a/drivers/common/idpf/idpf_common_device.h=20 > b/drivers/common/idpf/idpf_common_device.h > index bfa927a5ff..734be1c88a 100644 > --- a/drivers/common/idpf/idpf_common_device.h > +++ b/drivers/common/idpf/idpf_common_device.h > @@ -123,6 +123,7 @@ struct idpf_vport { > =20 > bool rx_vec_allowed; > bool tx_vec_allowed; > + bool rx_use_avx2; > bool rx_use_avx512; > bool tx_use_avx512; > =20 > diff --git a/drivers/common/idpf/idpf_common_rxtx.h=20 > b/drivers/common/idpf/idpf_common_rxtx.h > index eeeeed12e2..f50cf5ef46 100644 > --- a/drivers/common/idpf/idpf_common_rxtx.h > +++ b/drivers/common/idpf/idpf_common_rxtx.h > @@ -302,5 +302,9 @@ uint16_t idpf_dp_splitq_xmit_pkts_avx512(void=20 > *tx_queue, struct rte_mbuf **tx_pk __rte_internal uint16_t=20 > idpf_dp_singleq_recv_scatter_pkts(void *rx_queue, struct rte_mbuf **rx_pk= ts, > uint16_t nb_pkts); > +__rte_internal > +uint16_t idpf_dp_singleq_recv_pkts_avx2(void *rx_queue, > + struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts); > =20 I'm a little confused by the "singleq" part of the name here, can you expla= in a little (in the commit message, perhaps) what is the "single" referring to? Does the driver have the ability to poll multiple queues at o= nce or something? [SHAIQ] - Idpf supports singleq and splitq models. In the singleq model, pa= ckets are processed and stored in order within a single RX queue. This will= be explicitly mentioned in v2 of the patch. > #endif /* _IDPF_COMMON_RXTX_H_ */ > diff --git a/drivers/common/idpf/idpf_common_rxtx_avx2.c=20 > b/drivers/common/idpf/idpf_common_rxtx_avx2.c > new file mode 100644 > index 0000000000..a05b26c68a > --- /dev/null > +++ b/drivers/common/idpf/idpf_common_rxtx_avx2.c > @@ -0,0 +1,590 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2023 Intel Corporation */ > + > +#include > + > +#include "idpf_common_rxtx.h" > +#include "idpf_common_device.h" > + > +#ifndef __INTEL_COMPILER > +#pragma GCC diagnostic ignored "-Wcast-qual" > +#endif There is work ongoing to stop using this warning removal [1]. This code may= need to be rebased on top of that if it's applied soon. [1] https://patches.dpdk.org/project/dpdk/list/?series=3D34390 [SHAIQ]- As for the warnings, we will address them once the patchset [https= ://patches.dpdk.org/project/dpdk/list/?series=3D34390] is merged. > + > +static __rte_always_inline void > +idpf_singleq_rx_rearm(struct idpf_rx_queue *rxq) { > + int i; > + uint16_t rx_id; > + volatile union virtchnl2_rx_desc *rxdp =3D rxq->rx_ring; > + struct rte_mbuf **rxep =3D &rxq->sw_ring[rxq->rxrearm_start]; > + > + rxdp +=3D rxq->rxrearm_start; > + > + /* Pull 'n' more MBUFs into the software ring */ > + if (rte_mempool_get_bulk(rxq->mp, > + (void *)rxep, > + IDPF_RXQ_REARM_THRESH) < 0) { > + if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=3D > + rxq->nb_rx_desc) { > + __m128i dma_addr0; > + > + dma_addr0 =3D _mm_setzero_si128(); > + for (i =3D 0; i < IDPF_VPMD_DESCS_PER_LOOP; i++) { > + rxep[i] =3D &rxq->fake_mbuf; > + _mm_store_si128((__m128i *)&rxdp[i].read, > + dma_addr0); > + } > + } > + rte_atomic_fetch_add_explicit(&rxq->rx_stats.mbuf_alloc_failed, > + IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed); > + > + return; > + } > + > + struct rte_mbuf *mb0, *mb1; > + __m128i dma_addr0, dma_addr1; > + __m128i hdr_room =3D _mm_set_epi64x(RTE_PKTMBUF_HEADROOM, > + RTE_PKTMBUF_HEADROOM); > + /* Initialize the mbufs in vector, process 2 mbufs in one loop */ > + for (i =3D 0; i < IDPF_RXQ_REARM_THRESH; i +=3D 2, rxep +=3D 2) { > + __m128i vaddr0, vaddr1; > + > + mb0 =3D rxep[0]; > + mb1 =3D rxep[1]; > + > + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */ > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=3D > + offsetof(struct rte_mbuf, buf_addr) + 8); > + vaddr0 =3D _mm_loadu_si128((__m128i *)&mb0->buf_addr); > + vaddr1 =3D _mm_loadu_si128((__m128i *)&mb1->buf_addr); > + > + /* convert pa to dma_addr hdr/data */ > + dma_addr0 =3D _mm_unpackhi_epi64(vaddr0, vaddr0); > + dma_addr1 =3D _mm_unpackhi_epi64(vaddr1, vaddr1); > + > + /* add headroom to pa values */ > + dma_addr0 =3D _mm_add_epi64(dma_addr0, hdr_room); > + dma_addr1 =3D _mm_add_epi64(dma_addr1, hdr_room); > + > + /* flush desc with pa dma_addr */ > + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0); > + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1); > + } > + > + rxq->rxrearm_start +=3D IDPF_RXQ_REARM_THRESH; > + if (rxq->rxrearm_start >=3D rxq->nb_rx_desc) > + rxq->rxrearm_start =3D 0; > + > + rxq->rxrearm_nb -=3D IDPF_RXQ_REARM_THRESH; > + > + rx_id =3D (uint16_t)((rxq->rxrearm_start =3D=3D 0) ? > + (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1)); > + > + /* Update the tail pointer on the NIC */ > + IDPF_PCI_REG_WRITE(rxq->qrx_tail, rx_id); } > + > +static inline uint16_t > +_idpf_singleq_recv_raw_pkts_vec_avx2(struct idpf_rx_queue *rxq, struct r= te_mbuf **rx_pkts, > + uint16_t nb_pkts, uint8_t *split_packet) { #define=20 > +IDPF_DESCS_PER_LOOP_AVX 8 > + > + const uint32_t *ptype_tbl =3D rxq->adapter->ptype_tbl; > + const __m256i mbuf_init =3D _mm256_set_epi64x(0, 0, > + 0, rxq->mbuf_initializer); > + struct rte_mbuf **sw_ring =3D &rxq->sw_ring[rxq->rx_tail]; > + volatile union virtchnl2_rx_desc *rxdp =3D rxq->rx_ring; > + const int avx_aligned =3D ((rxq->rx_tail & 1) =3D=3D 0); > + > + rxdp +=3D rxq->rx_tail; > + > + rte_prefetch0(rxdp); > + > + /* nb_pkts has to be floor-aligned to IDPF_DESCS_PER_LOOP_AVX */ > + nb_pkts =3D RTE_ALIGN_FLOOR(nb_pkts, IDPF_DESCS_PER_LOOP_AVX); > + > + /* See if we need to rearm the RX queue - gives the prefetch a bit > + * of time to act > + */ > + if (rxq->rxrearm_nb > IDPF_RXQ_REARM_THRESH) > + idpf_singleq_rx_rearm(rxq); > + > + /* Before we start moving massive data around, check to see if > + * there is actually a packet available > + */ > + if (!(rxdp->flex_nic_wb.status_error0 & > + rte_cpu_to_le_32(1 << VIRTCHNL2_RX_FLEX_DESC_STATUS0_DD_S))) > + return 0; > + > + /* 8 packets DD mask, LSB in each 32-bit value */ > + const __m256i dd_check =3D _mm256_set1_epi32(1); > + > + /* 8 packets EOP mask, second-LSB in each 32-bit value */ > + const __m256i eop_check =3D _mm256_slli_epi32(dd_check, > + VIRTCHNL2_RX_FLEX_DESC_STATUS0_EOF_S); > + > + /* mask to shuffle from desc. to mbuf (2 descriptors)*/ > + const __m256i shuf_msk =3D > + _mm256_set_epi8 > + (/* first descriptor */ > + 0xFF, 0xFF, > + 0xFF, 0xFF, /* rss hash parsed separately */ > + 11, 10, /* octet 10~11, 16 bits vlan_macip */ > + 5, 4, /* octet 4~5, 16 bits data_len */ > + 0xFF, 0xFF, /* skip hi 16 bits pkt_len, zero out */ > + 5, 4, /* octet 4~5, 16 bits pkt_len */ > + 0xFF, 0xFF, /* pkt_type set as unknown */ > + 0xFF, 0xFF, /*pkt_type set as unknown */ > + /* second descriptor */ > + 0xFF, 0xFF, > + 0xFF, 0xFF, /* rss hash parsed separately */ > + 11, 10, /* octet 10~11, 16 bits vlan_macip */ > + 5, 4, /* octet 4~5, 16 bits data_len */ > + 0xFF, 0xFF, /* skip hi 16 bits pkt_len, zero out */ > + 5, 4, /* octet 4~5, 16 bits pkt_len */ > + 0xFF, 0xFF, /* pkt_type set as unknown */ > + 0xFF, 0xFF /*pkt_type set as unknown */ > + ); > + /** > + * compile-time check the above crc and shuffle layout is correct. > + * NOTE: the first field (lowest address) is given last in set_epi > + * calls above. > + */ > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=3D > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4); > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=3D > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, vlan_tci) !=3D > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 10); > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) !=3D > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12); > + > + /* Status/Error flag masks */ > + /** > + * mask everything except Checksum Reports, RSS indication > + * and VLAN indication. > + * bit6:4 for IP/L4 checksum errors. > + * bit12 is for RSS indication. > + * bit13 is for VLAN indication. > + */ > + const __m256i flags_mask =3D > + _mm256_set1_epi32((0xF << 4) | (1 << 12) | (1 << 13)); > + /** > + * data to be shuffled by the result of the flags mask shifted by 4 > + * bits. This gives use the l3_l4 flags. > + */ > + const __m256i l3_l4_flags_shuf =3D > + _mm256_set_epi8((RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | > + RTE_MBUF_F_RX_OUTER_IP_CKSUM_BAD | RTE_MBUF_F_RX_L4_CKSUM_BAD | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD= | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD= | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + /** > + * second 128-bits > + * shift right 20 bits to use the low two bits to indicate > + * outer checksum status > + * shift right 1 bit to make sure it not exceed 255 > + */ > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSUM= _BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_BAD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_BAD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_OUTER_IP_CKSU= M_BAD | > + RTE_MBUF_F_RX_L4_CKSUM_GOOD | RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_BAD = | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD= | > + RTE_MBUF_F_RX_IP_CKSUM_BAD) >> 1, > + (RTE_MBUF_F_RX_OUTER_L4_CKSUM_GOOD >> 20 | RTE_MBUF_F_RX_L4_CKSUM_GOOD= | > + RTE_MBUF_F_RX_IP_CKSUM_GOOD) >> 1); > + const __m256i cksum_mask =3D > + _mm256_set1_epi32(RTE_MBUF_F_RX_IP_CKSUM_MASK | > + RTE_MBUF_F_RX_L4_CKSUM_MASK | > + RTE_MBUF_F_RX_OUTER_IP_CKSUM_BAD | > + RTE_MBUF_F_RX_OUTER_L4_CKSUM_MASK); > + /** > + * data to be shuffled by result of flag mask, shifted down 12. > + * If RSS(bit12)/VLAN(bit13) are set, > + * shuffle moves appropriate flags in place. > + */ > + const __m256i rss_vlan_flags_shuf =3D _mm256_set_epi8(0, 0, 0, 0, > + 0, 0, 0, 0, > + 0, 0, 0, 0, > + RTE_MBUF_F_RX_RSS_HASH | RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRI= PPED, > + RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED, > + RTE_MBUF_F_RX_RSS_HASH, 0, > + /* end up 128-bits */ > + 0, 0, 0, 0, > + 0, 0, 0, 0, > + 0, 0, 0, 0, > + RTE_MBUF_F_RX_RSS_HASH | RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRI= PPED, > + RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED, > + RTE_MBUF_F_RX_RSS_HASH, 0); > + > + RTE_SET_USED(avx_aligned); /* for 32B descriptors we don't use this=20 > +*/ > + Does the driver or HW support 16B descriptors? If not, just remove this var= iable completely. Don't keep it just for consistency with other drivers. [SHAIQ]- The support for 16B descriptors will be retained since they are su= pported by the HW. > + uint16_t i, received; > + > + for (i =3D 0, received =3D 0; i < nb_pkts; > + i +=3D IDPF_DESCS_PER_LOOP_AVX, > + rxdp +=3D IDPF_DESCS_PER_LOOP_AVX) { > + /* step 1, copy over 8 mbuf pointers to rx_pkts array */ > + _mm256_storeu_si256((void *)&rx_pkts[i], > + _mm256_loadu_si256((void *)&sw_ring[i])); #ifdef=20 > +RTE_ARCH_X86_64 > + _mm256_storeu_si256 > + ((void *)&rx_pkts[i + 4], > + _mm256_loadu_si256((void *)&sw_ring[i + 4])); #endif > + > + __m256i raw_desc0_1, raw_desc2_3, raw_desc4_5, raw_desc6_7; > + > + const __m128i raw_desc7 =3D > + _mm_load_si128((void *)(rxdp + 7)); > + rte_compiler_barrier(); > + const __m128i raw_desc6 =3D > + _mm_load_si128((void *)(rxdp + 6)); > + rte_compiler_barrier(); > + const __m128i raw_desc5 =3D > + _mm_load_si128((void *)(rxdp + 5)); > + rte_compiler_barrier(); > + const __m128i raw_desc4 =3D > + _mm_load_si128((void *)(rxdp + 4)); > + rte_compiler_barrier(); > + const __m128i raw_desc3 =3D > + _mm_load_si128((void *)(rxdp + 3)); > + rte_compiler_barrier(); > + const __m128i raw_desc2 =3D > + _mm_load_si128((void *)(rxdp + 2)); > + rte_compiler_barrier(); > + const __m128i raw_desc1 =3D > + _mm_load_si128((void *)(rxdp + 1)); > + rte_compiler_barrier(); > + const __m128i raw_desc0 =3D > + _mm_load_si128((void *)(rxdp + 0)); > + Here and a number of other places, I think you can reduce the amount of wor= d-wrapping being done. Unlike when the first AVX2 code was being written, w= e now can use up to 100 characters be line. [SHAIQ]- We will reduce excessive word wrapping in some sections to enhance= readability. > + raw_desc6_7 =3D > + _mm256_inserti128_si256 > + (_mm256_castsi128_si256(raw_desc6), > + raw_desc7, 1); > + raw_desc4_5 =3D > + _mm256_inserti128_si256 > + (_mm256_castsi128_si256(raw_desc4), > + raw_desc5, 1); > + raw_desc2_3 =3D > + _mm256_inserti128_si256 > + (_mm256_castsi128_si256(raw_desc2), > + raw_desc3, 1); > + raw_desc0_1 =3D > + _mm256_inserti128_si256 > + (_mm256_castsi128_si256(raw_desc0), > + raw_desc1, 1); > + > + if (split_packet) { > + int j; > + > + for (j =3D 0; j < IDPF_DESCS_PER_LOOP_AVX; j++) > + rte_mbuf_prefetch_part2(rx_pkts[i + j]); > + } > + I don't see any use of buffer reassembly for multi-segment packets in the d= river code. If it's not planned to add this, then you can remove this block= . If it is planned to add it, then keep this, and you can base the implemen= tation on the common function being extracted out of other drivers[2]. [2] https://patches.dpdk.org/project/dpdk/patch/20250120120016.1530274-3-br= uce.richardson@intel.com/ [SHAIQ]- We will drop the code related to buffer reassembly for multi-segme= nt packets, as per your suggestion. > + /** > + * convert descriptors 4-7 into mbufs, re-arrange fields. > + * Then write into the mbuf. > + */ > + __m256i mb6_7 =3D _mm256_shuffle_epi8(raw_desc6_7, shuf_msk); > + __m256i mb4_5 =3D _mm256_shuffle_epi8(raw_desc4_5, shuf_msk); > + > + /** > + * to get packet types, ptype is located in bit16-25 > + * of each 128bits > + */ > + const __m256i ptype_mask =3D > + _mm256_set1_epi16(VIRTCHNL2_RX_FLEX_DESC_PTYPE_M); > + const __m256i ptypes6_7 =3D > + _mm256_and_si256(raw_desc6_7, ptype_mask); > + const __m256i ptypes4_5 =3D > + _mm256_and_si256(raw_desc4_5, ptype_mask); > + const uint16_t ptype7 =3D _mm256_extract_epi16(ptypes6_7, 9); > + const uint16_t ptype6 =3D _mm256_extract_epi16(ptypes6_7, 1); > + const uint16_t ptype5 =3D _mm256_extract_epi16(ptypes4_5, 9); > + const uint16_t ptype4 =3D _mm256_extract_epi16(ptypes4_5, 1); > + > + mb6_7 =3D _mm256_insert_epi32(mb6_7, ptype_tbl[ptype7], 4); > + mb6_7 =3D _mm256_insert_epi32(mb6_7, ptype_tbl[ptype6], 0); > + mb4_5 =3D _mm256_insert_epi32(mb4_5, ptype_tbl[ptype5], 4); > + mb4_5 =3D _mm256_insert_epi32(mb4_5, ptype_tbl[ptype4], 0); > + /* merge the status bits into one register */ > + const __m256i status4_7 =3D _mm256_unpackhi_epi32(raw_desc6_7, > + raw_desc4_5); > + > + /** > + * convert descriptors 0-3 into mbufs, re-arrange fields. > + * Then write into the mbuf. > + */ > + __m256i mb2_3 =3D _mm256_shuffle_epi8(raw_desc2_3, shuf_msk); > + __m256i mb0_1 =3D _mm256_shuffle_epi8(raw_desc0_1, shuf_msk); > + > + /** > + * to get packet types, ptype is located in bit16-25 > + * of each 128bits > + */ > + const __m256i ptypes2_3 =3D > + _mm256_and_si256(raw_desc2_3, ptype_mask); > + const __m256i ptypes0_1 =3D > + _mm256_and_si256(raw_desc0_1, ptype_mask); > + const uint16_t ptype3 =3D _mm256_extract_epi16(ptypes2_3, 9); > + const uint16_t ptype2 =3D _mm256_extract_epi16(ptypes2_3, 1); > + const uint16_t ptype1 =3D _mm256_extract_epi16(ptypes0_1, 9); > + const uint16_t ptype0 =3D _mm256_extract_epi16(ptypes0_1, 1); > + > + mb2_3 =3D _mm256_insert_epi32(mb2_3, ptype_tbl[ptype3], 4); > + mb2_3 =3D _mm256_insert_epi32(mb2_3, ptype_tbl[ptype2], 0); > + mb0_1 =3D _mm256_insert_epi32(mb0_1, ptype_tbl[ptype1], 4); > + mb0_1 =3D _mm256_insert_epi32(mb0_1, ptype_tbl[ptype0], 0); > + /* merge the status bits into one register */ > + const __m256i status0_3 =3D _mm256_unpackhi_epi32(raw_desc2_3, > + raw_desc0_1); > + > + /** > + * take the two sets of status bits and merge to one > + * After merge, the packets status flags are in the > + * order (hi->lo): [1, 3, 5, 7, 0, 2, 4, 6] > + */ > + __m256i status0_7 =3D _mm256_unpacklo_epi64(status4_7, > + status0_3); > + > + /* now do flag manipulation */ > + > + /* get only flag/error bits we want */ > + const __m256i flag_bits =3D > + _mm256_and_si256(status0_7, flags_mask); > + /** > + * l3_l4_error flags, shuffle, then shift to correct adjustment > + * of flags in flags_shuf, and finally mask out extra bits > + */ > + __m256i l3_l4_flags =3D _mm256_shuffle_epi8(l3_l4_flags_shuf, > + _mm256_srli_epi32(flag_bits, 4)); > + l3_l4_flags =3D _mm256_slli_epi32(l3_l4_flags, 1); > + > + __m256i l4_outer_mask =3D _mm256_set1_epi32(0x6); > + __m256i l4_outer_flags =3D > + _mm256_and_si256(l3_l4_flags, l4_outer_mask); > + l4_outer_flags =3D _mm256_slli_epi32(l4_outer_flags, 20); > + > + __m256i l3_l4_mask =3D _mm256_set1_epi32(~0x6); > + l3_l4_flags =3D _mm256_and_si256(l3_l4_flags, l3_l4_mask); > + l3_l4_flags =3D _mm256_or_si256(l3_l4_flags, l4_outer_flags); > + l3_l4_flags =3D _mm256_and_si256(l3_l4_flags, cksum_mask); > + /* set rss and vlan flags */ > + const __m256i rss_vlan_flag_bits =3D > + _mm256_srli_epi32(flag_bits, 12); > + const __m256i rss_vlan_flags =3D > + _mm256_shuffle_epi8(rss_vlan_flags_shuf, > + rss_vlan_flag_bits); > + > + /* merge flags */ > + __m256i mbuf_flags =3D _mm256_or_si256(l3_l4_flags, > + rss_vlan_flags); > + > + /** > + * At this point, we have the 8 sets of flags in the low 16-bits > + * of each 32-bit value in vlan0. > + * We want to extract these, and merge them with the mbuf init > + * data so we can do a single write to the mbuf to set the flags > + * and all the other initialization fields. Extracting the > + * appropriate flags means that we have to do a shift and blend > + * for each mbuf before we do the write. However, we can also > + * add in the previously computed rx_descriptor fields to > + * make a single 256-bit write per mbuf > + */ > + /* check the structure matches expectations */ > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) !=3D > + offsetof(struct rte_mbuf, rearm_data) + 8); > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) !=3D > + RTE_ALIGN(offsetof(struct rte_mbuf, > + rearm_data), > + 16)); > + /* build up data and do writes */ > + __m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5, > + rearm6, rearm7; > + rearm6 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_slli_si256(mbuf_flags, 8), > + 0x04); > + rearm4 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_slli_si256(mbuf_flags, 4), > + 0x04); > + rearm2 =3D _mm256_blend_epi32(mbuf_init, mbuf_flags, 0x04); > + rearm0 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_srli_si256(mbuf_flags, 4), > + 0x04); > + /* permute to add in the rx_descriptor e.g. rss fields */ > + rearm6 =3D _mm256_permute2f128_si256(rearm6, mb6_7, 0x20); > + rearm4 =3D _mm256_permute2f128_si256(rearm4, mb4_5, 0x20); > + rearm2 =3D _mm256_permute2f128_si256(rearm2, mb2_3, 0x20); > + rearm0 =3D _mm256_permute2f128_si256(rearm0, mb0_1, 0x20); > + /* write to mbuf */ > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 6]->rearm_data, > + rearm6); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 4]->rearm_data, > + rearm4); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 2]->rearm_data, > + rearm2); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 0]->rearm_data, > + rearm0); > + > + /* repeat for the odd mbufs */ > + const __m256i odd_flags =3D > + _mm256_castsi128_si256 > + (_mm256_extracti128_si256(mbuf_flags, 1)); > + rearm7 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_slli_si256(odd_flags, 8), > + 0x04); > + rearm5 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_slli_si256(odd_flags, 4), > + 0x04); > + rearm3 =3D _mm256_blend_epi32(mbuf_init, odd_flags, 0x04); > + rearm1 =3D _mm256_blend_epi32(mbuf_init, > + _mm256_srli_si256(odd_flags, 4), > + 0x04); > + /* since odd mbufs are already in hi 128-bits use blend */ > + rearm7 =3D _mm256_blend_epi32(rearm7, mb6_7, 0xF0); > + rearm5 =3D _mm256_blend_epi32(rearm5, mb4_5, 0xF0); > + rearm3 =3D _mm256_blend_epi32(rearm3, mb2_3, 0xF0); > + rearm1 =3D _mm256_blend_epi32(rearm1, mb0_1, 0xF0); > + /* again write to mbufs */ > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 7]->rearm_data, > + rearm7); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 5]->rearm_data, > + rearm5); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 3]->rearm_data, > + rearm3); > + _mm256_storeu_si256((__m256i *)&rx_pkts[i + 1]->rearm_data, > + rearm1); > + > + /* extract and record EOP bit */ > + if (split_packet) { > + const __m128i eop_mask =3D > + _mm_set1_epi16(1 << VIRTCHNL2_RX_FLEX_DESC_STATUS0_EOF_S); > + const __m256i eop_bits256 =3D _mm256_and_si256(status0_7, > + eop_check); > + /* pack status bits into a single 128-bit register */ > + const __m128i eop_bits =3D > + _mm_packus_epi32 > + (_mm256_castsi256_si128(eop_bits256), > + _mm256_extractf128_si256(eop_bits256, > + 1)); > + /** > + * flip bits, and mask out the EOP bit, which is now > + * a split-packet bit i.e. !EOP, rather than EOP one. > + */ > + __m128i split_bits =3D _mm_andnot_si128(eop_bits, > + eop_mask); > + /** > + * eop bits are out of order, so we need to shuffle them > + * back into order again. In doing so, only use low 8 > + * bits, which acts like another pack instruction > + * The original order is (hi->lo): 1,3,5,7,0,2,4,6 > + * [Since we use epi8, the 16-bit positions are > + * multiplied by 2 in the eop_shuffle value.] > + */ > + __m128i eop_shuffle =3D > + _mm_set_epi8(/* zero hi 64b */ > + 0xFF, 0xFF, 0xFF, 0xFF, > + 0xFF, 0xFF, 0xFF, 0xFF, > + /* move values to lo 64b */ > + 8, 0, 10, 2, > + 12, 4, 14, 6); > + split_bits =3D _mm_shuffle_epi8(split_bits, eop_shuffle); > + *(uint64_t *)split_packet =3D > + _mm_cvtsi128_si64(split_bits); > + split_packet +=3D IDPF_DESCS_PER_LOOP_AVX; > + } As above, if there are no plans to support multi-buffer packet reassembly, = drop this. [SHAIQ] - We plan to drop this . > + > + /* perform dd_check */ > + status0_7 =3D _mm256_and_si256(status0_7, dd_check); > + status0_7 =3D _mm256_packs_epi32(status0_7, > + _mm256_setzero_si256()); > + > + uint64_t burst =3D rte_popcount64 > + (_mm_cvtsi128_si64 > + (_mm256_extracti128_si256 > + (status0_7, 1))); > + burst +=3D rte_popcount64 > + (_mm_cvtsi128_si64 > + (_mm256_castsi256_si128(status0_7))); > + received +=3D burst; > + if (burst !=3D IDPF_DESCS_PER_LOOP_AVX) > + break; > + } > + > + /* update tail pointers */ > + rxq->rx_tail +=3D received; > + rxq->rx_tail &=3D (rxq->nb_rx_desc - 1); > + if ((rxq->rx_tail & 1) =3D=3D 1 && received > 1) { /* keep avx2 aligned= */ > + rxq->rx_tail--; > + received--; > + } > + rxq->rxrearm_nb +=3D received; > + return received; > +} > + > +/** > + * Notice: > + * - nb_pkts < IDPF_DESCS_PER_LOOP, just return no packet */=20 > +uint16_t idpf_dp_singleq_recv_pkts_avx2(void *rx_queue, struct=20 > +rte_mbuf **rx_pkts, > + uint16_t nb_pkts) > +{ > + return _idpf_singleq_recv_raw_pkts_vec_avx2(rx_queue, rx_pkts,=20 > +nb_pkts, NULL); } > diff --git a/drivers/common/idpf/meson.build=20 > b/drivers/common/idpf/meson.build index 46fd45c03b..4caa06a9b7 100644 > --- a/drivers/common/idpf/meson.build > +++ b/drivers/common/idpf/meson.build > @@ -16,6 +16,21 @@ sources =3D files( > ) > =20 > if arch_subdir =3D=3D 'x86' > + # compile AVX2 version if either: > + # a. we have AVX supported in minimum instruction set baseline > + # b. it's not minimum instruction set, but supported by compiler > + if cc.get_define('__AVX2__', args: machine_args) !=3D '' > + cflags +=3D ['-DCC_AVX2_SUPPORT'] > + sources +=3D files('idpf_common_rxtx_avx2.c') > + elif cc.has_argument('-mavx2') This logic is out-of-date, since all supported compilers have AVX2 support. Suggest reworking using drivers/net/ice/meson.build as a reference. [SHAIQ]- The meson.build script will be reworked, using drivers/net/ice/mes= on.build as a reference. > + cflags +=3D ['-DCC_AVX2_SUPPORT'] > + idpf_avx2_lib =3D static_library('idpf_avx2_lib', > + 'idpf_common_rxtx_avx2.c', > + dependencies: [static_rte_ethdev, static_rte_kvargs, stat= ic_rte_hash], > + include_directories: includes, > + c_args: [cflags, '-mavx2']) > + objs +=3D idpf_avx2_lib.extract_objects('idpf_common_rxtx_avx2.c'= ) > + endif > if cc_has_avx512 > cflags +=3D ['-DCC_AVX512_SUPPORT'] > avx512_args =3D cflags + cc_avx512_flags diff --git=20 > a/drivers/common/idpf/version.map b/drivers/common/idpf/version.map=20 > index 0729f6b912..4510aae6b3 100644 > --- a/drivers/common/idpf/version.map > +++ b/drivers/common/idpf/version.map > @@ -14,6 +14,7 @@ INTERNAL { > idpf_dp_splitq_recv_pkts_avx512; > idpf_dp_splitq_xmit_pkts; > idpf_dp_splitq_xmit_pkts_avx512; > + idpf_dp_singleq_recv_pkts_avx2; > =20 This list should be alphabetical, so singleq should go before splitq. [SHAIQ]- will address the issue in v2. > idpf_qc_rx_thresh_check; > idpf_qc_rx_queue_release; > diff --git a/drivers/net/idpf/idpf_rxtx.c=20 > b/drivers/net/idpf/idpf_rxtx.c index 858bbefe3b..80c6c325e8 100644 > --- a/drivers/net/idpf/idpf_rxtx.c > +++ b/drivers/net/idpf/idpf_rxtx.c > @@ -776,6 +776,11 @@ idpf_set_rx_function(struct rte_eth_dev *dev) > rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_128) { > vport->rx_vec_allowed =3D true; > =20 > + if ((rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) =3D=3D 1 || > + rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) =3D=3D 1) && There are no CPUs that support AVX512 witout supporting AVX2 - and if there= were we probably couldn't use an AVX2 code path on them anyway. Therefore = only check the AVX2 flag and the bitwidth. [SHAIQ]- Will address in v2. > + rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_256) > + vport->rx_use_avx2 =3D true; > + > if (rte_vect_get_max_simd_bitwidth() >=3D RTE_VECT_SIMD_512) #ifdef=20 > CC_AVX512_SUPPORT > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) =3D=3D 1 && @@=20 > -827,6 +832,13 @@ idpf_set_rx_function(struct rte_eth_dev *dev) > return; > } > #endif /* CC_AVX512_SUPPORT */ > + if (vport->rx_use_avx2) { > + PMD_DRV_LOG(NOTICE, > + "Using Single AVX2 Vector Rx (port %d).", > + dev->data->port_id); > + dev->rx_pkt_burst =3D idpf_dp_singleq_recv_pkts_avx2; > + return; > + } > } > =20 > if (dev->data->scattered_rx) { > -- > 2.34.1 >=20