From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6DA4E454E4; Mon, 24 Jun 2024 18:14:25 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 37121402AF; Mon, 24 Jun 2024 18:14:25 +0200 (CEST) Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by mails.dpdk.org (Postfix) with ESMTP id D91DE40291 for ; Mon, 24 Jun 2024 18:14:22 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719245664; x=1750781664; h=from:to:cc:subject:date:message-id:references: in-reply-to:mime-version; bh=TJRXFKN3Qt2bfT5GQxnFwqrFFYG7eqRdjx8laZacVLc=; b=TkfbK0OkGUzcxlulzM+Yzx8YyqvYVyYRorZA1Fm0ymm9EZAprTHmuQfi SA9afvjf4woIEk16v0ErDfP8PrJ4pQZ1Zm1fT4VOziZrWMR0zRI9D7f2C izs74RphzmjKsvkya/J/0XFcFezKKh6079FLJ9OcblfqtjrCzRvGZBhak gugnEsJZ43uaPYU24tdx58E4hJdZ7jK6AS2U/+V/xdEuo69LFWX2BOtsP SbpHiMOI+xHErS2fxdWJuRiD3P8H3kcyZLaoH0zt5U2Y+Io1W/iUulIIr UDsoVE+fJUINBuS4GA/R/bS5PubBxgDre7a1Ie9kcvxFycaElr5Db1Mx4 w==; X-CSE-ConnectionGUID: R447sFK8SuWNcw1KD/y9HA== X-CSE-MsgGUID: /I11UAvTTMSbCRcIRPaY0w== X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="27636000" X-IronPort-AV: E=Sophos;i="6.08,262,1712646000"; d="scan'208,217";a="27636000" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2024 09:14:22 -0700 X-CSE-ConnectionGUID: lZAlsjBhTV+a+/byl6vk5Q== X-CSE-MsgGUID: A+hsMzreSPSeqGgCZf1+nA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,262,1712646000"; d="scan'208,217";a="48517106" Received: from orsmsx602.amr.corp.intel.com ([10.22.229.15]) by orviesa004.jf.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 24 Jun 2024 09:14:21 -0700 Received: from orsmsx601.amr.corp.intel.com (10.22.229.14) by ORSMSX602.amr.corp.intel.com (10.22.229.15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 24 Jun 2024 09:14:20 -0700 Received: from ORSEDG601.ED.cps.intel.com (10.7.248.6) by orsmsx601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39 via Frontend Transport; Mon, 24 Jun 2024 09:14:20 -0700 Received: from NAM04-DM6-obe.outbound.protection.outlook.com (104.47.73.41) by edgegateway.intel.com (134.134.137.102) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Mon, 24 Jun 2024 09:14:20 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BH7nihXtXnls5StUyKOJ0Tsh7vpaMxYSoQGhP/XhCHzwKm2hkeeeKPj/l9raR/CjkZmODIdAAATkCAX1R2v9URwWxH2aAC+MwusH32P7NCUWoJ1nnr+Gq8tofz/IXHouy626prSAC8S1S2FgMJ4ucE7oKdxEJZHzudTd3O1BNa9Y/Jl2O2AypWIyxqrIkwibiv1/K5j3wqBKiFydaTTNQw1qHQY+v3zfmHBLFADpH5wohO8c0j5sLYyoCvoJESQ8WxjHQGUszZJjHHa88GsZCKi6nIGaPu9543Gz7PXDXM5x21F8Slp9sFwJZAm26WocIoEBpspBHikGb1Sy0SOOcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=cr4S93qS1w/CFzHvKLmp0s4A88peI8p+MPg/4zcpNN8=; b=LeemOl45GxziseL+7iyEwkvWciH3EH8wYk8AZvUoWVjzjDfft/h/w67usLPTDIRga3cI+XL/10Ffe3LEwoBaSioc6IzU/7zNux7IRXPtApCJKn/1MC30S/i7pObII80Ea497HumVyynEb1oJPav8sR1RFx1wl2dCCTqFwvRKTrEqhPzgaURAgvphX90Q7ab4eeuxBZqyjHwPbkY95q/J1sqHH4LkGrDzM9Krhdh2lvCorKG5CWalKfj3rni8r3378/ncQiWZklNCsPsVpx2wfGnCHXAPYtHijDStp/wjn4GACB4Fc/Ikw4GX4SOllUmHxMz+PHIdj4e1Ln8NGScg9w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from DS0PR11MB7458.namprd11.prod.outlook.com (2603:10b6:8:145::13) by DS0PR11MB8208.namprd11.prod.outlook.com (2603:10b6:8:165::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7698.30; Mon, 24 Jun 2024 16:14:18 +0000 Received: from DS0PR11MB7458.namprd11.prod.outlook.com ([fe80::1a9e:53a6:9603:8f79]) by DS0PR11MB7458.namprd11.prod.outlook.com ([fe80::1a9e:53a6:9603:8f79%5]) with mapi id 15.20.7698.024; Mon, 24 Jun 2024 16:14:18 +0000 From: "Ji, Kai" To: Jack Bond-Preston CC: "dev@dpdk.org" Subject: Re: [PATCH 0/5] OpenSSL PMD Optimisations Thread-Topic: [PATCH 0/5] OpenSSL PMD Optimisations Thread-Index: AQHatc9xIZa+q484K0u+TSx1lxIn/bHXNzNo Date: Mon, 24 Jun 2024 16:14:17 +0000 Message-ID: References: <20240603160119.1279476-1-jack.bond-preston@foss.arm.com> In-Reply-To: <20240603160119.1279476-1-jack.bond-preston@foss.arm.com> Accept-Language: en-GB, en-US, en-IE Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DS0PR11MB7458:EE_|DS0PR11MB8208:EE_ x-ms-office365-filtering-correlation-id: fb984850-5d07-41ba-80c5-08dc9468b2fd x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230037|366013|1800799021|376011|38070700015; x-microsoft-antispam-message-info: =?us-ascii?Q?4Wq/8bnvxYnWmeMjxzeEPlupkv3hnPk1wwmCpI88YDwiSfxYjDKa13oiOVKC?= =?us-ascii?Q?UfvUD75O2kIQsAkbfZbjGtU3eJSkATplTruyfgnVNnJleUCJxztMMnumyWAI?= =?us-ascii?Q?x0FU+Q49T9S/mPEc6NpgzO9vm6lYM+rVwgAcrOHTUGJBNmWawgWiH5eM/+ZV?= =?us-ascii?Q?bTnw6qpZVxTXYMWwC/m8peoufLthY9Oa4w025rmiMLlpA6SOfkWbVQuLRxYX?= =?us-ascii?Q?mdkHxVKAfcOb0FG6UkUfu0pUVICD49tOXg/ViUZtyjQoxh00LYsLBU9rZpwF?= =?us-ascii?Q?ZWEDmhfKvNJcbjWRyQJyDJdcbJhqKN6jJxypwHa0JY9Qi1djSBY6VaelKqzA?= =?us-ascii?Q?dXAohbbGRw8nI/eTWrc9nuuw8MHUVSQZWGJL8oVAyrABjciAhTQecO1k8Uqi?= =?us-ascii?Q?CN/to1cofIzDGWzwbQnxFgLwGmlI9D/jcNsts+ldgwMr40dsklomCR/BA0AS?= =?us-ascii?Q?ZxNNfQHYNSSt8NRyw8uhRTmzPdEhQBfBkXJ5ElemJEJgHKlJuF8PL6mqXFkt?= =?us-ascii?Q?KMQz7LfB6ZcgkWiXSeRwZwO8lPFj6dCVoeziX/qYTmbCUoAoqR3LbkwIiFwT?= =?us-ascii?Q?/l1r85/eWlCFcLzkjchkg7uUoEPwdevEzXo9nND8G9fsLD4mbI+8tlB/+bgR?= =?us-ascii?Q?e3rZqWgXStCggw4bJuPepdW7tImjFdl9+aX2MykPNzfQTNQXs1J376fvtkwc?= =?us-ascii?Q?G164iZaaRdtPFTjoxr0QQN7ti6hJHw0PKCu+zOsmq3KqYOqlwY/waLdHzf5a?= =?us-ascii?Q?+eUXzfNmlpcoTRiWPg2RsrDm+s9wQH6Q6gLNLzuCPJOnOOHSJKGL2ew8ftR+?= =?us-ascii?Q?5ztJFa1ttwH65csOrA64IgffS8Gw4eck4faPDHXw/ubep+RVVKloqafyekc8?= =?us-ascii?Q?Gc/jzPNtTuKev8A3mnKX2M3y83stFZjgT579g+TPzh7QuWJOOGqR7mngC1Rh?= =?us-ascii?Q?YNZRSoAZmRHgmkH74QtX58Z2ZjU/FtBwn1APqKRrU1heV9bUG7OJ0yOkFE5z?= =?us-ascii?Q?2AFO2jKwyWtyCP6XC38mUxfXAyhVl8Kj4Tjmk0QvWvdl4naWDEe4t3NFPxQq?= =?us-ascii?Q?A9/lDYir+fzYf9xn1QpCHCaJzH7SxJGGaAURJ2Zh1s1VKHRBVucsJVIYj2nx?= =?us-ascii?Q?DiRi0fkxDPpq8yMQmKU21pX1Z4A0L1wYMvhZo0MTxN43ducpCDPvMGOdpgMC?= =?us-ascii?Q?SH1lndot5nV5uA9WqUX2IhwwfwM3nnQ6Y8NJaUumilS26GXb+cVNilr8oCjd?= =?us-ascii?Q?ABbqYbCJjBrFEV3vCdT23/C/Qxws8ZuKX0F8CtnBC/ML8qjsJf9/14m0+O+V?= =?us-ascii?Q?U2pt7COs6HkOgnJowzIlwz04DzLHZ74yZfbblG8bUOkcGB9ePK5qmnCrqETz?= =?us-ascii?Q?CJ2cNxA=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DS0PR11MB7458.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230037)(366013)(1800799021)(376011)(38070700015); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?ZjqFVjkPW3OtIl3dFnNJkRCTkMmTjPM4bN4xr8MhkriyL3zlD8EE0krPwMs7?= =?us-ascii?Q?NfsO+o2DqSB8d2N429lhk412SvS/MRSCloTI652FokzNXeG72O3MoO1QfH4d?= =?us-ascii?Q?gHrXI3m1775fJknHqgw2Nkyuqu/ueDvUnkePU4mrIucT+8fZnlHt10iC9VUZ?= =?us-ascii?Q?p9KlAe0RWodHxIu/qBOiGh5Q+YaYQ32qY9yAaBZn/0jcNeCGmxJorV5Ua1MU?= =?us-ascii?Q?wE65y+aJh5mVM5YZGxQn+3gWVPWsYzzs8/CKpgjBOA/60bt9AUPeCgwhoMed?= =?us-ascii?Q?trbxC4oFizDa/Ypc9FOsWsPWYK5fwDOIMCm0kzxnN+1z7NVlmr8IdcZO+qdK?= =?us-ascii?Q?58Gk59j8DBZjOB3xq9UhdWR4haeLdE7Z//Op5lT2uzaLgN+I/Cch2zkoLjXJ?= =?us-ascii?Q?BtKJFzGvOr68ndQ4TzaCU6CCguHdkwDF7Qc/vXRPGhXCNDO9JcWXmhgosEJ5?= =?us-ascii?Q?72gQ47QIvXTYwJNtccT9mS25eITd8T6UVQQVhpCxbyVUngcGF0Lgtdsuw4+6?= =?us-ascii?Q?vaxSLTYiO/9RPIAlpBZ+zYQle78hc4wzbhPNQqlKw4JhK2oMOQo2P3Ofkcl/?= =?us-ascii?Q?AI72gpM4/L0L+SX2QkPF9SN/zerUIKx9bMV6xzsj+5WOt9XMh7Prn62OiujI?= =?us-ascii?Q?VMMnTI8p4yrHqUYkaZreBoY4bfrOnZCjTb0boLid+HFA1Ia0aGU0IQvDtOvn?= =?us-ascii?Q?/PXhcblzFjRSlMxYVxQxdBB0feRtw/75WNawy9rmlK0vyJWU5KRUomSd9bfl?= =?us-ascii?Q?wHW8PcrtE6hCRo6i0W+cS9TrQy6EPevU5waAGIJ/Am53Xz2JW+DFkr+aLzo9?= =?us-ascii?Q?nBu2pCYwQK5HiKcpOLK9GuSF0Pt5cC5+MD2yrrS7MIJ777ktjuVTLR2IEt7i?= =?us-ascii?Q?b4ARGUnttZcX0tiJrDX4P5CEi6vg831ggJEv2ZSgUco4csYxiL89THi3hGap?= =?us-ascii?Q?HZ4TDw4ho2MKeVX1CHGvi/d+WwJMeOL0PKw7QhMOxHH2ehFcUMot3y8iRJ6x?= =?us-ascii?Q?E1v+6HfBdEV2K41mMFzQhP/N3/q42lSLx8BxM5MDd1eOwHfuYcrEdNqIQmZv?= =?us-ascii?Q?4ccFyM4qak2hTK8dUicyLlGDZqK+d4svpWK0A+pqlVhfO7f5C1ihsstTemhf?= =?us-ascii?Q?RbW6WQpURU3EWVFbISHUoOSszN/mRBBnc5G7KKDpFkqtpcaEwq9vMUPTqdv9?= =?us-ascii?Q?ZdfW90X/DxgGuTiAGp+Os8cHCDdQ2iH2nNT7xb53eSjk5vMzTVcICgEz9HNd?= =?us-ascii?Q?NqsEsl59Za8ooBwHDB4el86aTdDPwaneTVQyWsKwD2/LN7Khdpq3I7mpPKvx?= =?us-ascii?Q?g+EbWgVHghRCZCEJrwvpXLTjCgJhvot1k48tGGJa6RE3X2x2OITU/UOdg7im?= =?us-ascii?Q?rin3GhxZbBzHIMO3y4mpZ05HPIy8mWGldMuNjK17SFuiQURJNL4O0DI65q7y?= =?us-ascii?Q?HxUNzd7Rgv548760KwdgNuPsjWZnOX4E2piYfHTsvEgayUIR4M0FE46RB1fY?= =?us-ascii?Q?j7x0Hvi9zIwrshhwlwPoe7QXRAn9Z6Wmmr6J8SKrLsqMN2uBDZ/MN6sFtz8z?= =?us-ascii?Q?VTbvsV2R9IvU9rUkn9s=3D?= Content-Type: multipart/alternative; boundary="_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_" MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DS0PR11MB7458.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: fb984850-5d07-41ba-80c5-08dc9468b2fd X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Jun 2024 16:14:18.0073 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 3C3+xylHjgpc1+nzc9xUot463Cbqd0nzQ9fvXkjbow9NJD+qItUW1WQbKzgxX+2hU3JPvJ+P2akGlzW/vJnH3g== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR11MB8208 X-OriginatorOrg: intel.com X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org --_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Series-acked-by: Kai Ji ________________________________ From: Jack Bond-Preston Sent: 03 June 2024 17:01 Cc: dev@dpdk.org Subject: [PATCH 0/5] OpenSSL PMD Optimisations The current implementation of the OpenSSL PMD has numerous performance issu= es. These revolve around certain operations being performed on a per buffer/pac= ket basis, when they in fact could be performed less often - usually just durin= g initialisation. [1/5]: fix GCM and CCM thread unsafe ctxs =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is implemented in the same naive (and inefficient) way as existing fixes for o= ther ciphers, and is optimised later in [3/5]. [2/5]: only init 3DES-CTR key + impl once =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Fixes an inefficient usage of the OpenSSL API for 3DES-CTR. [5/5]: only set cipher padding once =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Fixes an inefficient usage of the OpenSSL API when disabling padding for ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/open= ssl: fix extra bytes written at end of data"), which fixes a bug - however, the EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location. This patch fixes this, preventing the padding being disabled for the cipher twice per buffer (with the second essentially being a wasteful no-op). [3/5] and [4/5]: per-queue-pair context clones =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D [3/5] and [4/5] aim to fix the key issue that was identified with the performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a per-buffer basis. This behaviour was introduced in 2019: > commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a > Author: Thierry Herbelot > Date: Wed Sep 11 18:06:01 2019 +0200 > > crypto/openssl: use local copy for session contexts > > Session contexts are used for temporary storage when processing a > packet. > If packets for the same session are to be processed simultaneously on > multiple cores, separate contexts must be used. > > Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a > variable on the stack: it must be allocated. This in turn reduces the > performance. Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be = used from multiple threads simultaneously, so this patch is required for correct= ness (assuming the need to support using the same openssl_session across multipl= e lcores). The downside here is that, as the commit message notes, this does reduce performance quite significantly. It is worth noting that while ciphers were already correctly cloned for cip= her ops and auth ops, this behaviour was actually absent for combined ops (AES-= GCM and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed t= his issue of correctness, and [3/5] implements a more performant fix on top of = this. These two patches aim to remedy the performance loss caused by the introduc= tion of cipher context cloning. An approach of maintaining an array of pointers, inside the OpenSSL session structure, to per-queue-pair clones of the OpenS= SL CTXs is used. Consequently, there is no need to perform cloning of the cont= ext for every buffer - whilst keeping the guarantee that one context is not bei= ng used on multiple lcores simultaneously. The cloning of the main context int= o the array's per-qp context entries is performed lazily/as-needed. There are som= e trade-offs/judgement calls that were made: - The first call for a queue pair for an op from a given openssl_session w= ill be roughly equivalent to an op from the existing implementation. However= , all subsequent calls for the same openssl_session on the same queue pair wil= l not incur this extra work. Thus, whilst the first op on a session on a queue= pair will be slower than subsequent ones, this slower first op is still equiv= alent to *every* op without these patches. The alternative would be pre-popula= ting this array when the openssl_session is initialised, but this would waste memory and processing time if not all queue pairs end up doing work from= this openssl_session. - Each pointer inside the array of per-queue-pair pointers has not been ca= che aligned, because updates only occur on the first buffer per-queue-pair per-session, making the impact of false sharing negligible compared to t= he extra memory usage of the alignment. [3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4= /5] for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.). Compared to before, this approach comes with a drawback of extra memory usa= ge - the cause of which is twofold: - The openssl_session struct has grown to accommodate the array, with a len= gth equal to the number of qps in use multiplied by 2 (to allow auth and ciph= er contexts), per openssl_session structure. openssl_pmd_sym_session_get_siz= e() is modified to return a size large enough to support this. At the time th= is function is called (before the user creates the session mempool), the PMD= may not yet be configured with the requested number of queue pairs. In this c= ase, the maximum number of queue pairs allowed by the PMD (current default is = 8) is used, to ensure the allocations will be large enough. Thus, the user may = be able to slightly reduce the memory used by OpenSSL sessions by first configuring the PMD's queue pair count, then requesting the size of the sessions and creating the session mempool. There is also a special case w= here the number of queue pairs is 1, in which case the array is not allocated = or used at all. Overall, this memory usage by the session structure itself i= s worst-case 128 bytes per session (the default maximum number of queue pai= rs allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extr= a space to store the length of the array and auth context offset, resulting= in an increase in total size from 152 bytes to 280 bytes. - The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, t= he clones were allocated and freed per-operation, meaning the lifetime of th= e allocations was only the duration of the operation. Now, these allocation= s are lifted out to share the lifetime of the session. This results in situatio= ns with many long-lived sessions shared across many queue pairs causing an increase in total memory usage. Performance Comparisons =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Benchmarks were collected using dpdk-test-crypto-perf, for the following configurations: - The version of OpenSSL used was 3.3.0 - The hardware used for the benchmarks was the following two machine confi= gs: * AArch64: Ampere Altra Max (128 N1 cores, 1 socket) * x86 : Intel Xeon Platinum 8480+ (128 cores, 2 sockets) - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 20= 48, 4096, 8192. - The worker lcore counts tested were: 1, 2, 4, 8 - The algorithms and associated operations tested were: * Cipher-only AES-CBC-128 (Encrypt and Decrypt) * Cipher-only 3DES-CTR-128 (Encrypt only) * Auth-only SHA1-HMAC (Generate only) * Auth-only AES-CMAC (Generate only) * AESNI AES-GCM-128 (Encrypt and Decrypt) * Cipher-then-Auth AES-CBC-128-HMAC-SHA1 (Encrypt only) - EAL was configured with Legacy Memory Mode enabled. The application was always run on isolated CPU cores on the same socket. The sets of patches applied for benchmarks were: - No patches applied (HEAD of upstream main) - [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue) - [1-2/5] applied (adds 3DES-CTR fix) - [1-3/5] applied (adds per-qp cipher contexts) - [1-4/5] applied (adds per-qp auth contexts) - [1-5/5] applied (adds cipher padding setting fix) For brevity, all results included in the cover letter are from the Arm plat= form, with all patches applied. Very similar results were achieved on the Intel platform, and the full set of results, including the Intel ones, is availab= le. AES-CBC-128 Encrypt Throughput Speedup -------------------------------------- A comparison of the throughput speedup achieved between the base (main bran= ch HEAD) and optimised (all patches applied) versions of the PMD was carried o= ut, with the varying worker lcore counts. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.84 | 2.04 | 144.6% | | 64 | 1.61 | 3.72 | 131.3% | | 128 | 2.97 | 6.24 | 110.2% | | 256 | 5.14 | 9.42 | 83.2% | | 512 | 8.10 | 12.62 | 55.7% | | 1024 | 11.37 | 15.18 | 33.5% | | 2048 | 14.26 | 16.93 | 18.7% | | 4096 | 16.35 | 17.97 | 9.9% | | 8192 | 17.61 | 18.51 | 5.1% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 1.53 | 16.49 | 974.8% | | 64 | 3.04 | 29.85 | 881.3% | | 128 | 5.96 | 50.07 | 739.8% | | 256 | 10.54 | 75.53 | 616.5% | | 512 | 21.60 | 101.14 | 368.2% | | 1024 | 41.27 | 121.56 | 194.6% | | 2048 | 72.99 | 135.40 | 85.5% | | 4096 | 103.39 | 143.76 | 39.0% | | 8192 | 125.48 | 148.06 | 18.0% | It is evident from these results that the speedup with 8 worker lcores is significantly larger. This was surprising at first, so profiling of the exi= sting PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX contains an EVP_CIPHER, which represents the actual cipher algorithm implementation backing this context. OpenSSL holds only one instance of eac= h EVP_CIPHER, and uses a reference counter to track freeing them. This means = that the original implementation spends a very high amount of time incrementing = and decrementing this reference counter in EVP_CIPHER_CTX_copy and EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lc= ores, this reference count modification happens extremely frequently - thrashing = this refcount on all lcores and causing a huge slowdown. The optimised version a= voids this by not performing the copy and free (and thus associated refcount modifications) on every buffer. SHA1-HMAC Generate Throughput Speedup ------------------------------------- 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.32 | 0.76 | 135.9% | | 64 | 0.63 | 1.43 | 126.9% | | 128 | 1.21 | 2.60 | 115.4% | | 256 | 2.23 | 4.42 | 98.1% | | 512 | 3.88 | 6.80 | 75.5% | | 1024 | 6.13 | 9.30 | 51.8% | | 2048 | 8.65 | 11.39 | 31.7% | | 4096 | 10.90 | 12.85 | 17.9% | | 8192 | 12.54 | 13.74 | 9.5% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.49 | 5.99 | 1110.3% | | 64 | 0.98 | 11.30 | 1051.8% | | 128 | 1.95 | 20.67 | 960.3% | | 256 | 3.90 | 35.18 | 802.4% | | 512 | 7.83 | 54.13 | 590.9% | | 1024 | 15.80 | 74.11 | 369.2% | | 2048 | 31.30 | 90.97 | 190.6% | | 4096 | 58.59 | 102.70 | 75.3% | | 8192 | 85.93 | 109.88 | 27.9% | We can see the results are similar as for AES-CBC-128 cipher operations. AES-GCM-128 Encrypt Throughput Speedup -------------------------------------- As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix f= or the concurrency bug introduces a large overhead. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 64 | 2.60 | 1.31 | -49.5% | | 256 | 7.69 | 4.45 | -42.1% | | 1024 | 15.33 | 11.30 | -26.3% | | 2048 | 18.74 | 15.37 | -18.0% | | 4096 | 21.11 | 18.80 | -10.9% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 64 | 19.94 | 2.83 | -85.8% | | 256 | 58.84 | 11.00 | -81.3% | | 1024 | 119.71 | 42.46 | -64.5% | | 2048 | 147.69 | 80.91 | -45.2% | | 4096 | 167.39 | 121.25 | -27.6% | However, applying [3/5] rectifies most of this performance drop, as shown b= y the following results with it applied. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 1.39 | 1.28 | -7.8% | | 64 | 2.60 | 2.44 | -6.2% | | 128 | 4.77 | 4.45 | -6.8% | | 256 | 7.69 | 7.22 | -6.1% | | 512 | 11.31 | 10.97 | -3.0% | | 1024 | 15.33 | 15.07 | -1.7% | | 2048 | 18.74 | 18.51 | -1.2% | | 4096 | 21.11 | 20.96 | -0.7% | | 8192 | 22.55 | 22.50 | -0.2% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 10.59 | 10.35 | -2.3% | | 64 | 19.94 | 19.46 | -2.4% | | 128 | 36.32 | 35.64 | -1.9% | | 256 | 58.84 | 57.80 | -1.8% | | 512 | 87.38 | 87.37 | -0.0% | | 1024 | 119.71 | 120.22 | 0.4% | | 2048 | 147.69 | 147.93 | 0.2% | | 4096 | 167.39 | 167.48 | 0.1% | | 8192 | 179.80 | 179.87 | 0.0% | The results show that, for AES-GCM-128 encrypt, there is still a small slow= down at smaller buffer sizes. This represents the overhead required to make AES-= GCM thread safe. These patches have rectified this lack of safety without causi= ng a significant performance impact, especially compared to naive per-buffer cip= her context cloning. 3DES-CTR Encrypt ---------------- 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.12 | 0.22 | 89.7% | | 64 | 0.16 | 0.22 | 43.6% | | 128 | 0.18 | 0.23 | 22.3% | | 256 | 0.20 | 0.23 | 10.8% | | 512 | 0.21 | 0.23 | 5.1% | | 1024 | 0.22 | 0.23 | 2.7% | | 2048 | 0.22 | 0.23 | 1.3% | | 4096 | 0.23 | 0.23 | 0.4% | | 8192 | 0.23 | 0.23 | 0.4% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.68 | 1.77 | 160.1% | | 64 | 1.00 | 1.78 | 78.3% | | 128 | 1.29 | 1.80 | 39.6% | | 256 | 1.50 | 1.80 | 19.8% | | 512 | 1.64 | 1.80 | 10.0% | | 1024 | 1.72 | 1.81 | 5.1% | | 2048 | 1.76 | 1.81 | 2.7% | | 4096 | 1.78 | 1.81 | 1.5% | | 8192 | 1.80 | 1.81 | 0.7% | [1/4] yields good results - the performance increase is high for lower buff= er sizes, where the cost of re-initialising the extra parameters is more significant compared to the cost of the cipher operation. Full Data and Additional Bar Charts ----------------------------------- The full raw data (CSV) and a PDF of all generated figures (all generated speedup tables, plus additional bar charts showing the throughput compariso= n across different sets of applied patches) - for both Intel and Arm platform= s - are available. However, I'm not sure of the ettiquette regarding attachment= s of such files, so I haven't attached them for now. If you are interested in reviewing them, please reach out and I will find a way to get them to you. Jack Bond-Preston (5): crypto/openssl: fix GCM and CCM thread unsafe ctxs crypto/openssl: only init 3DES-CTR key + impl once crypto/openssl: per-qp cipher context clones crypto/openssl: per-qp auth context clones crypto/openssl: only set cipher padding once drivers/crypto/openssl/compat.h | 26 ++ drivers/crypto/openssl/openssl_pmd_private.h | 26 +- drivers/crypto/openssl/rte_openssl_pmd.c | 244 ++++++++++++++----- drivers/crypto/openssl/rte_openssl_pmd_ops.c | 35 ++- 4 files changed, 260 insertions(+), 71 deletions(-) -- 2.34.1 --_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
Series-acked-by: Kai Ji <kai.ji@intel.com>

From: Jack Bond-Preston <= ;jack.bond-preston@foss.arm.com>
Sent: 03 June 2024 17:01
Cc: dev@dpdk.org <dev@dpdk.org>
Subject: [PATCH 0/5] OpenSSL PMD Optimisations
 
The current implementation of the OpenSSL PMD has = numerous performance issues.
These revolve around certain operations being performed on a per buffer/pac= ket
basis, when they in fact could be performed less often - usually just durin= g
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is<= br> implemented in the same naive (and inefficient) way as existing fixes for o= ther
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes an inefficient usage of the OpenSSL API when disabling padding for ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto= /openssl:
fix extra bytes written at end of data"), which fixes a bug - however,= the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher=
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session con= texts
>
>     Session contexts are used for temporary storag= e when processing a
>     packet.
>     If packets for the same session are to be proc= essed simultaneously on
>     multiple cores, separate contexts must be used= .
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no= longer be defined as a
>     variable on the stack: it must be allocated. T= his in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be = used
from multiple threads simultaneously, so this patch is required for correct= ness
(assuming the need to support using the same openssl_session across multipl= e
lcores). The downside here is that, as the commit message notes, this does<= br> reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cip= her
ops and auth ops, this behaviour was actually absent for combined ops (AES-= GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] a= ddressed this
issue of correctness, and [3/5] implements a more performant fix on top of = this.

These two patches aim to remedy the performance loss caused by the introduc= tion
of cipher context cloning. An approach of maintaining an array of pointers,=
inside the OpenSSL session structure, to per-queue-pair clones of the OpenS= SL
CTXs is used. Consequently, there is no need to perform cloning of the cont= ext
for every buffer - whilst keeping the guarantee that one context is not bei= ng
used on multiple lcores simultaneously. The cloning of the main context int= o the
array's per-qp context entries is performed lazily/as-needed. There are som= e
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_sess= ion will
   be roughly equivalent to an op from the existing implementatio= n. However, all
   subsequent calls for the same openssl_session on the same queu= e pair will not
   incur this extra work. Thus, whilst the first op on a session = on a queue pair
   will be slower than subsequent ones, this slower first op is s= till equivalent
   to *every* op without these patches. The alternative would be = pre-populating
   this array when the openssl_session is initialised, but this w= ould waste
   memory and processing time if not all queue pairs end up doing= work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not be= en cache
   aligned, because updates only occur on the first buffer per-qu= eue-pair
   per-session, making the impact of false sharing negligible com= pared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4= /5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usa= ge -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a len= gth
  equal to the number of qps in use multiplied by 2 (to allow auth and= cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_ge= t_size()
  is modified to return a size large enough to support this. At the ti= me this
  function is called (before the user creates the session mempool), th= e PMD may
  not yet be configured with the requested number of queue pairs. In t= his case,
  the maximum number of queue pairs allowed by the PMD (current defaul= t is 8) is
  used, to ensure the allocations will be large enough. Thus, the user= may be
  able to slightly reduce the memory used by OpenSSL sessions by first=
  configuring the PMD's queue pair count, then requesting the size of = the
  sessions and creating the session mempool. There is also a special c= ase where
  the number of queue pairs is 1, in which case the array is not alloc= ated or
  used at all. Overall, this memory usage by the session structure its= elf is
  worst-case 128 bytes per session (the default maximum number of queu= e pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the= extra
  space to store the length of the array and auth context offset, resu= lting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, t= he
  clones were allocated and freed per-operation, meaning the lifetime = of the
  allocations was only the duration of the operation. Now, these alloc= ations are
  lifted out to share the lifetime of the session. This results in sit= uations
  with many long-lived sessions shared across many queue pairs causing= an
  increase in total memory usage.


Performance Comparisons
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Benchmarks were collected using dpdk-test-crypto-perf, for the following configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine = configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socke= t)
     * x86    : Intel Xeon Platinum 8480= + (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 102= 4, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       = AES-CBC-128           (En= crypt and Decrypt)
     * Cipher-only       = 3DES-CTR-128          (Encrypt= only)
     * Auth-only      &nb= sp;  SHA1-HMAC         &n= bsp;   (Generate only)
     * Auth-only      &nb= sp;  AES-CMAC         &nb= sp;    (Generate only)
     * AESNI       &= nbsp;     AES-GCM-128     &nbs= p;     (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (En= crypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.
The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency is= sue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm plat= form,
with all patches applied. Very similar results were achieved on the Intel platform, and the full set of results, including the Intel ones, is availab= le.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main bran= ch
HEAD) and optimised (all patches applied) versions of the PMD was carried o= ut,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          0.84 | = ;            &n= bsp; 2.04 |   144.6% |
|            &n= bsp; 64 |          1.61 | = ;            &n= bsp; 3.72 |   131.3% |
|             1= 28 |          2.97 | &nbs= p;             = 6.24 |   110.2% |
|             2= 56 |          5.14 | &nbs= p;             = 9.42 |    83.2% |
|             5= 12 |          8.10 | &nbs= p;            12.62 = |    55.7% |
|            1024 |&= nbsp;        11.37 |   &n= bsp;          15.18 | &nb= sp;  33.5% |
|            2048 |&= nbsp;        14.26 |   &n= bsp;          16.93 | &nb= sp;  18.7% |
|            4096 |&= nbsp;        16.35 |   &n= bsp;          17.97 | &nb= sp;   9.9% |
|            8192 |&= nbsp;        17.61 |   &n= bsp;          18.51 | &nb= sp;   5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          1.53 | = ;             1= 6.49 |   974.8% |
|            &n= bsp; 64 |          3.04 | = ;             2= 9.85 |   881.3% |
|             1= 28 |          5.96 | &nbs= p;            50.07 = |   739.8% |
|             2= 56 |         10.54 |  &nb= sp;           75.53 |&nbs= p;  616.5% |
|             5= 12 |         21.60 |  &nb= sp;          101.14 | &nb= sp; 368.2% |
|            1024 |&= nbsp;        41.27 |   &n= bsp;         121.56 |   1= 94.6% |
|            2048 |&= nbsp;        72.99 |   &n= bsp;         135.40 |  &n= bsp; 85.5% |
|            4096 |&= nbsp;       103.39 |    &= nbsp;        143.76 |    = 39.0% |
|            8192 |&= nbsp;       125.48 |    &= nbsp;        148.06 |    = 18.0% |

It is evident from these results that the speedup with 8 worker lcores is significantly larger. This was surprising at first, so profiling of the exi= sting
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX=
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of eac= h
EVP_CIPHER, and uses a reference counter to track freeing them. This means = that
the original implementation spends a very high amount of time incrementing = and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lc= ores,
this reference count modification happens extremely frequently - thrashing = this
refcount on all lcores and causing a huge slowdown. The optimised version a= voids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          0.32 | = ;            &n= bsp; 0.76 |   135.9% |
|            &n= bsp; 64 |          0.63 | = ;            &n= bsp; 1.43 |   126.9% |
|             1= 28 |          1.21 | &nbs= p;             = 2.60 |   115.4% |
|             2= 56 |          2.23 | &nbs= p;             = 4.42 |    98.1% |
|             5= 12 |          3.88 | &nbs= p;             = 6.80 |    75.5% |
|            1024 |&= nbsp;         6.13 |  &nb= sp;            9.30 = |    51.8% |
|            2048 |&= nbsp;         8.65 |  &nb= sp;           11.39 |&nbs= p;   31.7% |
|            4096 |&= nbsp;        10.90 |   &n= bsp;          12.85 | &nb= sp;  17.9% |
|            8192 |&= nbsp;        12.54 |   &n= bsp;          13.74 | &nb= sp;   9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          0.49 | = ;            &n= bsp; 5.99 |  1110.3% |
|            &n= bsp; 64 |          0.98 | = ;             1= 1.30 |  1051.8% |
|             1= 28 |          1.95 | &nbs= p;            20.67 = |   960.3% |
|             2= 56 |          3.90 | &nbs= p;            35.18 = |   802.4% |
|             5= 12 |          7.83 | &nbs= p;            54.13 = |   590.9% |
|            1024 |&= nbsp;        15.80 |   &n= bsp;          74.11 | &nb= sp; 369.2% |
|            2048 |&= nbsp;        31.30 |   &n= bsp;          90.97 | &nb= sp; 190.6% |
|            4096 |&= nbsp;        58.59 |   &n= bsp;         102.70 |  &n= bsp; 75.3% |
|            8192 |&= nbsp;        85.93 |   &n= bsp;         109.88 |  &n= bsp; 27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.
AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix f= or
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 64 |          2.60 | = ;            &n= bsp; 1.31 |   -49.5% |
|             2= 56 |          7.69 | &nbs= p;             = 4.45 |   -42.1% |
|            1024 |&= nbsp;        15.33 |   &n= bsp;          11.30 | &nb= sp; -26.3% |
|            2048 |&= nbsp;        18.74 |   &n= bsp;          15.37 | &nb= sp; -18.0% |
|            4096 |&= nbsp;        21.11 |   &n= bsp;          18.80 | &nb= sp; -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 64 |         19.94 | &nbs= p;             = 2.83 |   -85.8% |
|             2= 56 |         58.84 |  &nb= sp;           11.00 |&nbs= p;  -81.3% |
|            1024 |&= nbsp;       119.71 |    &= nbsp;         42.46 |   -= 64.5% |
|            2048 |&= nbsp;       147.69 |    &= nbsp;         80.91 |   -= 45.2% |
|            4096 |&= nbsp;       167.39 |    &= nbsp;        121.25 |   -27.6%= |

However, applying [3/5] rectifies most of this performance drop, as shown b= y the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          1.39 | = ;            &n= bsp; 1.28 |    -7.8% |
|            &n= bsp; 64 |          2.60 | = ;            &n= bsp; 2.44 |    -6.2% |
|             1= 28 |          4.77 | &nbs= p;             = 4.45 |    -6.8% |
|             2= 56 |          7.69 | &nbs= p;             = 7.22 |    -6.1% |
|             5= 12 |         11.31 |  &nb= sp;           10.97 |&nbs= p;   -3.0% |
|            1024 |&= nbsp;        15.33 |   &n= bsp;          15.07 | &nb= sp;  -1.7% |
|            2048 |&= nbsp;        18.74 |   &n= bsp;          18.51 | &nb= sp;  -1.2% |
|            4096 |&= nbsp;        21.11 |   &n= bsp;          20.96 | &nb= sp;  -0.7% |
|            8192 |&= nbsp;        22.55 |   &n= bsp;          22.50 | &nb= sp;  -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |         10.59 | &nbs= p;            10.35 = |    -2.3% |
|            &n= bsp; 64 |         19.94 | &nbs= p;            19.46 = |    -2.4% |
|             1= 28 |         36.32 |  &nb= sp;           35.64 |&nbs= p;   -1.9% |
|             2= 56 |         58.84 |  &nb= sp;           57.80 |&nbs= p;   -1.8% |
|             5= 12 |         87.38 |  &nb= sp;           87.37 |&nbs= p;   -0.0% |
|            1024 |&= nbsp;       119.71 |    &= nbsp;        120.22 |   &= nbsp; 0.4% |
|            2048 |&= nbsp;       147.69 |    &= nbsp;        147.93 |   &= nbsp; 0.2% |
|            4096 |&= nbsp;       167.39 |    &= nbsp;        167.48 |   &= nbsp; 0.1% |
|            8192 |&= nbsp;       179.80 |    &= nbsp;        179.87 |   &= nbsp; 0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slow= down
at smaller buffer sizes. This represents the overhead required to make AES-= GCM
thread safe. These patches have rectified this lack of safety without causi= ng a
significant performance impact, especially compared to naive per-buffer cip= her
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          0.12 | = ;            &n= bsp; 0.22 |    89.7% |
|            &n= bsp; 64 |          0.16 | = ;            &n= bsp; 0.22 |    43.6% |
|             1= 28 |          0.18 | &nbs= p;             = 0.23 |    22.3% |
|             2= 56 |          0.20 | &nbs= p;             = 0.23 |    10.8% |
|             5= 12 |          0.21 | &nbs= p;             = 0.23 |     5.1% |
|            1024 |&= nbsp;         0.22 |  &nb= sp;            0.23 = |     2.7% |
|            2048 |&= nbsp;         0.22 |  &nb= sp;            0.23 = |     1.3% |
|            4096 |&= nbsp;         0.23 |  &nb= sp;            0.23 = |     0.4% |
|            8192 |&= nbsp;         0.23 |  &nb= sp;            0.23 = |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimis= ed (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|            &n= bsp; 32 |          0.68 | = ;            &n= bsp; 1.77 |   160.1% |
|            &n= bsp; 64 |          1.00 | = ;            &n= bsp; 1.78 |    78.3% |
|             1= 28 |          1.29 | &nbs= p;             = 1.80 |    39.6% |
|             2= 56 |          1.50 | &nbs= p;             = 1.80 |    19.8% |
|             5= 12 |          1.64 | &nbs= p;             = 1.80 |    10.0% |
|            1024 |&= nbsp;         1.72 |  &nb= sp;            1.81 = |     5.1% |
|            2048 |&= nbsp;         1.76 |  &nb= sp;            1.81 = |     2.7% |
|            4096 |&= nbsp;         1.78 |  &nb= sp;            1.81 = |     1.5% |
|            8192 |&= nbsp;         1.80 |  &nb= sp;            1.81 = |     0.7% |

[1/4] yields good results - the performance increase is high for lower buff= er
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated speedup tables, plus additional bar charts showing the throughput compariso= n
across different sets of applied patches) - for both Intel and Arm platform= s -
are available. However, I'm not sure of the ettiquette regarding attachment= s of
such files, so I haven't attached them for now. If you are interested in reviewing them, please reach out and I will find a way to get them to you.<= br>
Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h      &n= bsp;       |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 24= 4 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

--
2.34.1

--_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_--