From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6DA4E454E4;
	Mon, 24 Jun 2024 18:14:25 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 37121402AF;
	Mon, 24 Jun 2024 18:14:25 +0200 (CEST)
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12])
 by mails.dpdk.org (Postfix) with ESMTP id D91DE40291
 for <dev@dpdk.org>; Mon, 24 Jun 2024 18:14:22 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1719245664; x=1750781664;
 h=from:to:cc:subject:date:message-id:references:
 in-reply-to:mime-version;
 bh=TJRXFKN3Qt2bfT5GQxnFwqrFFYG7eqRdjx8laZacVLc=;
 b=TkfbK0OkGUzcxlulzM+Yzx8YyqvYVyYRorZA1Fm0ymm9EZAprTHmuQfi
 SA9afvjf4woIEk16v0ErDfP8PrJ4pQZ1Zm1fT4VOziZrWMR0zRI9D7f2C
 izs74RphzmjKsvkya/J/0XFcFezKKh6079FLJ9OcblfqtjrCzRvGZBhak
 gugnEsJZ43uaPYU24tdx58E4hJdZ7jK6AS2U/+V/xdEuo69LFWX2BOtsP
 SbpHiMOI+xHErS2fxdWJuRiD3P8H3kcyZLaoH0zt5U2Y+Io1W/iUulIIr
 UDsoVE+fJUINBuS4GA/R/bS5PubBxgDre7a1Ie9kcvxFycaElr5Db1Mx4 w==;
X-CSE-ConnectionGUID: R447sFK8SuWNcw1KD/y9HA==
X-CSE-MsgGUID: /I11UAvTTMSbCRcIRPaY0w==
X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="27636000"
X-IronPort-AV: E=Sophos;i="6.08,262,1712646000"; d="scan'208,217";a="27636000"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
 by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Jun 2024 09:14:22 -0700
X-CSE-ConnectionGUID: lZAlsjBhTV+a+/byl6vk5Q==
X-CSE-MsgGUID: A+hsMzreSPSeqGgCZf1+nA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,262,1712646000"; d="scan'208,217";a="48517106"
Received: from orsmsx602.amr.corp.intel.com ([10.22.229.15])
 by orviesa004.jf.intel.com with ESMTP/TLS/AES256-GCM-SHA384;
 24 Jun 2024 09:14:21 -0700
Received: from orsmsx601.amr.corp.intel.com (10.22.229.14) by
 ORSMSX602.amr.corp.intel.com (10.22.229.15) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.39; Mon, 24 Jun 2024 09:14:20 -0700
Received: from ORSEDG601.ED.cps.intel.com (10.7.248.6) by
 orsmsx601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.39 via Frontend Transport; Mon, 24 Jun 2024 09:14:20 -0700
Received: from NAM04-DM6-obe.outbound.protection.outlook.com (104.47.73.41) by
 edgegateway.intel.com (134.134.137.102) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.39; Mon, 24 Jun 2024 09:14:20 -0700
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=BH7nihXtXnls5StUyKOJ0Tsh7vpaMxYSoQGhP/XhCHzwKm2hkeeeKPj/l9raR/CjkZmODIdAAATkCAX1R2v9URwWxH2aAC+MwusH32P7NCUWoJ1nnr+Gq8tofz/IXHouy626prSAC8S1S2FgMJ4ucE7oKdxEJZHzudTd3O1BNa9Y/Jl2O2AypWIyxqrIkwibiv1/K5j3wqBKiFydaTTNQw1qHQY+v3zfmHBLFADpH5wohO8c0j5sLYyoCvoJESQ8WxjHQGUszZJjHHa88GsZCKi6nIGaPu9543Gz7PXDXM5x21F8Slp9sFwJZAm26WocIoEBpspBHikGb1Sy0SOOcA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=cr4S93qS1w/CFzHvKLmp0s4A88peI8p+MPg/4zcpNN8=;
 b=LeemOl45GxziseL+7iyEwkvWciH3EH8wYk8AZvUoWVjzjDfft/h/w67usLPTDIRga3cI+XL/10Ffe3LEwoBaSioc6IzU/7zNux7IRXPtApCJKn/1MC30S/i7pObII80Ea497HumVyynEb1oJPav8sR1RFx1wl2dCCTqFwvRKTrEqhPzgaURAgvphX90Q7ab4eeuxBZqyjHwPbkY95q/J1sqHH4LkGrDzM9Krhdh2lvCorKG5CWalKfj3rni8r3378/ncQiWZklNCsPsVpx2wfGnCHXAPYtHijDStp/wjn4GACB4Fc/Ikw4GX4SOllUmHxMz+PHIdj4e1Ln8NGScg9w==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com;
 dkim=pass header.d=intel.com; arc=none
Received: from DS0PR11MB7458.namprd11.prod.outlook.com (2603:10b6:8:145::13)
 by DS0PR11MB8208.namprd11.prod.outlook.com (2603:10b6:8:165::18) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7698.30; Mon, 24 Jun
 2024 16:14:18 +0000
Received: from DS0PR11MB7458.namprd11.prod.outlook.com
 ([fe80::1a9e:53a6:9603:8f79]) by DS0PR11MB7458.namprd11.prod.outlook.com
 ([fe80::1a9e:53a6:9603:8f79%5]) with mapi id 15.20.7698.024; Mon, 24 Jun 2024
 16:14:18 +0000
From: "Ji, Kai" <kai.ji@intel.com>
To: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
CC: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [PATCH 0/5] OpenSSL PMD Optimisations
Thread-Topic: [PATCH 0/5] OpenSSL PMD Optimisations
Thread-Index: AQHatc9xIZa+q484K0u+TSx1lxIn/bHXNzNo
Date: Mon, 24 Jun 2024 16:14:17 +0000
Message-ID: <DS0PR11MB7458170E176873D1243E1CC081D42@DS0PR11MB7458.namprd11.prod.outlook.com>
References: <20240603160119.1279476-1-jack.bond-preston@foss.arm.com>
In-Reply-To: <20240603160119.1279476-1-jack.bond-preston@foss.arm.com>
Accept-Language: en-GB, en-US, en-IE
Content-Language: en-GB
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
msip_labels: 
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=intel.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: DS0PR11MB7458:EE_|DS0PR11MB8208:EE_
x-ms-office365-filtering-correlation-id: fb984850-5d07-41ba-80c5-08dc9468b2fd
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0; ARA:13230037|366013|1800799021|376011|38070700015;
x-microsoft-antispam-message-info: =?us-ascii?Q?4Wq/8bnvxYnWmeMjxzeEPlupkv3hnPk1wwmCpI88YDwiSfxYjDKa13oiOVKC?=
 =?us-ascii?Q?UfvUD75O2kIQsAkbfZbjGtU3eJSkATplTruyfgnVNnJleUCJxztMMnumyWAI?=
 =?us-ascii?Q?x0FU+Q49T9S/mPEc6NpgzO9vm6lYM+rVwgAcrOHTUGJBNmWawgWiH5eM/+ZV?=
 =?us-ascii?Q?bTnw6qpZVxTXYMWwC/m8peoufLthY9Oa4w025rmiMLlpA6SOfkWbVQuLRxYX?=
 =?us-ascii?Q?mdkHxVKAfcOb0FG6UkUfu0pUVICD49tOXg/ViUZtyjQoxh00LYsLBU9rZpwF?=
 =?us-ascii?Q?ZWEDmhfKvNJcbjWRyQJyDJdcbJhqKN6jJxypwHa0JY9Qi1djSBY6VaelKqzA?=
 =?us-ascii?Q?dXAohbbGRw8nI/eTWrc9nuuw8MHUVSQZWGJL8oVAyrABjciAhTQecO1k8Uqi?=
 =?us-ascii?Q?CN/to1cofIzDGWzwbQnxFgLwGmlI9D/jcNsts+ldgwMr40dsklomCR/BA0AS?=
 =?us-ascii?Q?ZxNNfQHYNSSt8NRyw8uhRTmzPdEhQBfBkXJ5ElemJEJgHKlJuF8PL6mqXFkt?=
 =?us-ascii?Q?KMQz7LfB6ZcgkWiXSeRwZwO8lPFj6dCVoeziX/qYTmbCUoAoqR3LbkwIiFwT?=
 =?us-ascii?Q?/l1r85/eWlCFcLzkjchkg7uUoEPwdevEzXo9nND8G9fsLD4mbI+8tlB/+bgR?=
 =?us-ascii?Q?e3rZqWgXStCggw4bJuPepdW7tImjFdl9+aX2MykPNzfQTNQXs1J376fvtkwc?=
 =?us-ascii?Q?G164iZaaRdtPFTjoxr0QQN7ti6hJHw0PKCu+zOsmq3KqYOqlwY/waLdHzf5a?=
 =?us-ascii?Q?+eUXzfNmlpcoTRiWPg2RsrDm+s9wQH6Q6gLNLzuCPJOnOOHSJKGL2ew8ftR+?=
 =?us-ascii?Q?5ztJFa1ttwH65csOrA64IgffS8Gw4eck4faPDHXw/ubep+RVVKloqafyekc8?=
 =?us-ascii?Q?Gc/jzPNtTuKev8A3mnKX2M3y83stFZjgT579g+TPzh7QuWJOOGqR7mngC1Rh?=
 =?us-ascii?Q?YNZRSoAZmRHgmkH74QtX58Z2ZjU/FtBwn1APqKRrU1heV9bUG7OJ0yOkFE5z?=
 =?us-ascii?Q?2AFO2jKwyWtyCP6XC38mUxfXAyhVl8Kj4Tjmk0QvWvdl4naWDEe4t3NFPxQq?=
 =?us-ascii?Q?A9/lDYir+fzYf9xn1QpCHCaJzH7SxJGGaAURJ2Zh1s1VKHRBVucsJVIYj2nx?=
 =?us-ascii?Q?DiRi0fkxDPpq8yMQmKU21pX1Z4A0L1wYMvhZo0MTxN43ducpCDPvMGOdpgMC?=
 =?us-ascii?Q?SH1lndot5nV5uA9WqUX2IhwwfwM3nnQ6Y8NJaUumilS26GXb+cVNilr8oCjd?=
 =?us-ascii?Q?ABbqYbCJjBrFEV3vCdT23/C/Qxws8ZuKX0F8CtnBC/ML8qjsJf9/14m0+O+V?=
 =?us-ascii?Q?U2pt7COs6HkOgnJowzIlwz04DzLHZ74yZfbblG8bUOkcGB9ePK5qmnCrqETz?=
 =?us-ascii?Q?CJ2cNxA=3D?=
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:;
 IPV:NLI; SFV:NSPM; H:DS0PR11MB7458.namprd11.prod.outlook.com; PTR:; CAT:NONE;
 SFS:(13230037)(366013)(1800799021)(376011)(38070700015); DIR:OUT; SFP:1101; 
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?ZjqFVjkPW3OtIl3dFnNJkRCTkMmTjPM4bN4xr8MhkriyL3zlD8EE0krPwMs7?=
 =?us-ascii?Q?NfsO+o2DqSB8d2N429lhk412SvS/MRSCloTI652FokzNXeG72O3MoO1QfH4d?=
 =?us-ascii?Q?gHrXI3m1775fJknHqgw2Nkyuqu/ueDvUnkePU4mrIucT+8fZnlHt10iC9VUZ?=
 =?us-ascii?Q?p9KlAe0RWodHxIu/qBOiGh5Q+YaYQ32qY9yAaBZn/0jcNeCGmxJorV5Ua1MU?=
 =?us-ascii?Q?wE65y+aJh5mVM5YZGxQn+3gWVPWsYzzs8/CKpgjBOA/60bt9AUPeCgwhoMed?=
 =?us-ascii?Q?trbxC4oFizDa/Ypc9FOsWsPWYK5fwDOIMCm0kzxnN+1z7NVlmr8IdcZO+qdK?=
 =?us-ascii?Q?58Gk59j8DBZjOB3xq9UhdWR4haeLdE7Z//Op5lT2uzaLgN+I/Cch2zkoLjXJ?=
 =?us-ascii?Q?BtKJFzGvOr68ndQ4TzaCU6CCguHdkwDF7Qc/vXRPGhXCNDO9JcWXmhgosEJ5?=
 =?us-ascii?Q?72gQ47QIvXTYwJNtccT9mS25eITd8T6UVQQVhpCxbyVUngcGF0Lgtdsuw4+6?=
 =?us-ascii?Q?vaxSLTYiO/9RPIAlpBZ+zYQle78hc4wzbhPNQqlKw4JhK2oMOQo2P3Ofkcl/?=
 =?us-ascii?Q?AI72gpM4/L0L+SX2QkPF9SN/zerUIKx9bMV6xzsj+5WOt9XMh7Prn62OiujI?=
 =?us-ascii?Q?VMMnTI8p4yrHqUYkaZreBoY4bfrOnZCjTb0boLid+HFA1Ia0aGU0IQvDtOvn?=
 =?us-ascii?Q?/PXhcblzFjRSlMxYVxQxdBB0feRtw/75WNawy9rmlK0vyJWU5KRUomSd9bfl?=
 =?us-ascii?Q?wHW8PcrtE6hCRo6i0W+cS9TrQy6EPevU5waAGIJ/Am53Xz2JW+DFkr+aLzo9?=
 =?us-ascii?Q?nBu2pCYwQK5HiKcpOLK9GuSF0Pt5cC5+MD2yrrS7MIJ777ktjuVTLR2IEt7i?=
 =?us-ascii?Q?b4ARGUnttZcX0tiJrDX4P5CEi6vg831ggJEv2ZSgUco4csYxiL89THi3hGap?=
 =?us-ascii?Q?HZ4TDw4ho2MKeVX1CHGvi/d+WwJMeOL0PKw7QhMOxHH2ehFcUMot3y8iRJ6x?=
 =?us-ascii?Q?E1v+6HfBdEV2K41mMFzQhP/N3/q42lSLx8BxM5MDd1eOwHfuYcrEdNqIQmZv?=
 =?us-ascii?Q?4ccFyM4qak2hTK8dUicyLlGDZqK+d4svpWK0A+pqlVhfO7f5C1ihsstTemhf?=
 =?us-ascii?Q?RbW6WQpURU3EWVFbISHUoOSszN/mRBBnc5G7KKDpFkqtpcaEwq9vMUPTqdv9?=
 =?us-ascii?Q?ZdfW90X/DxgGuTiAGp+Os8cHCDdQ2iH2nNT7xb53eSjk5vMzTVcICgEz9HNd?=
 =?us-ascii?Q?NqsEsl59Za8ooBwHDB4el86aTdDPwaneTVQyWsKwD2/LN7Khdpq3I7mpPKvx?=
 =?us-ascii?Q?g+EbWgVHghRCZCEJrwvpXLTjCgJhvot1k48tGGJa6RE3X2x2OITU/UOdg7im?=
 =?us-ascii?Q?rin3GhxZbBzHIMO3y4mpZ05HPIy8mWGldMuNjK17SFuiQURJNL4O0DI65q7y?=
 =?us-ascii?Q?HxUNzd7Rgv548760KwdgNuPsjWZnOX4E2piYfHTsvEgayUIR4M0FE46RB1fY?=
 =?us-ascii?Q?j7x0Hvi9zIwrshhwlwPoe7QXRAn9Z6Wmmr6J8SKrLsqMN2uBDZ/MN6sFtz8z?=
 =?us-ascii?Q?VTbvsV2R9IvU9rUkn9s=3D?=
Content-Type: multipart/alternative;
 boundary="_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: DS0PR11MB7458.namprd11.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: fb984850-5d07-41ba-80c5-08dc9468b2fd
X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Jun 2024 16:14:18.0073 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 3C3+xylHjgpc1+nzc9xUot463Cbqd0nzQ9fvXkjbow9NJD+qItUW1WQbKzgxX+2hU3JPvJ+P2akGlzW/vJnH3g==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR11MB8208
X-OriginatorOrg: intel.com
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

--_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Series-acked-by: Kai Ji <kai.ji@intel.com>
________________________________
From: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Sent: 03 June 2024 17:01
Cc: dev@dpdk.org <dev@dpdk.org>
Subject: [PATCH 0/5] OpenSSL PMD Optimisations

The current implementation of the OpenSSL PMD has numerous performance issu=
es.
These revolve around certain operations being performed on a per buffer/pac=
ket
basis, when they in fact could be performed less often - usually just durin=
g
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for o=
ther
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/open=
ssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be =
used
from multiple threads simultaneously, so this patch is required for correct=
ness
(assuming the need to support using the same openssl_session across multipl=
e
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cip=
her
ops and auth ops, this behaviour was actually absent for combined ops (AES-=
GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed t=
his
issue of correctness, and [3/5] implements a more performant fix on top of =
this.

These two patches aim to remedy the performance loss caused by the introduc=
tion
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenS=
SL
CTXs is used. Consequently, there is no need to perform cloning of the cont=
ext
for every buffer - whilst keeping the guarantee that one context is not bei=
ng
used on multiple lcores simultaneously. The cloning of the main context int=
o the
array's per-qp context entries is performed lazily/as-needed. There are som=
e
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session w=
ill
   be roughly equivalent to an op from the existing implementation. However=
, all
   subsequent calls for the same openssl_session on the same queue pair wil=
l not
   incur this extra work. Thus, whilst the first op on a session on a queue=
 pair
   will be slower than subsequent ones, this slower first op is still equiv=
alent
   to *every* op without these patches. The alternative would be pre-popula=
ting
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from=
 this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been ca=
che
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to t=
he
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4=
/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usa=
ge -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a len=
gth
  equal to the number of qps in use multiplied by 2 (to allow auth and ciph=
er
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_siz=
e()
  is modified to return a size large enough to support this. At the time th=
is
  function is called (before the user creates the session mempool), the PMD=
 may
  not yet be configured with the requested number of queue pairs. In this c=
ase,
  the maximum number of queue pairs allowed by the PMD (current default is =
8) is
  used, to ensure the allocations will be large enough. Thus, the user may =
be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case w=
here
  the number of queue pairs is 1, in which case the array is not allocated =
or
  used at all. Overall, this memory usage by the session structure itself i=
s
  worst-case 128 bytes per session (the default maximum number of queue pai=
rs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extr=
a
  space to store the length of the array and auth context offset, resulting=
 in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, t=
he
  clones were allocated and freed per-operation, meaning the lifetime of th=
e
  allocations was only the duration of the operation. Now, these allocation=
s are
  lifted out to share the lifetime of the session. This results in situatio=
ns
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine confi=
gs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 20=
48,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm plat=
form,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is availab=
le.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main bran=
ch
HEAD) and optimised (all patches applied) versions of the PMD was carried o=
ut,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the exi=
sting
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of eac=
h
EVP_CIPHER, and uses a reference counter to track freeing them. This means =
that
the original implementation spends a very high amount of time incrementing =
and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lc=
ores,
this reference count modification happens extremely frequently - thrashing =
this
refcount on all lcores and causing a huge slowdown. The optimised version a=
voids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix f=
or
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown b=
y the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slow=
down
at smaller buffer sizes. This represents the overhead required to make AES-=
GCM
thread safe. These patches have rectified this lack of safety without causi=
ng a
significant performance impact, especially compared to naive per-buffer cip=
her
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buff=
er
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput compariso=
n
across different sets of applied patches) - for both Intel and Arm platform=
s -
are available. However, I'm not sure of the ettiquette regarding attachment=
s of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

--
2.34.1


--_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<style type=3D"text/css" style=3D"display:none;"> P {margin-top:0;margin-bo=
ttom:0;} </style>
</head>
<body dir=3D"ltr">
<div class=3D"elementToProof" style=3D"font-family: &quot;IntelOne Text&quo=
t;; font-size: 10pt; color: rgb(0, 0, 0);">
Series-acked-by: Kai Ji &lt;kai.ji@intel.com&gt;</div>
<div id=3D"appendonsend"></div>
<hr style=3D"display:inline-block;width:98%" tabindex=3D"-1">
<div id=3D"divRplyFwdMsg" dir=3D"ltr"><font face=3D"Calibri, sans-serif" st=
yle=3D"font-size:11pt" color=3D"#000000"><b>From:</b> Jack Bond-Preston &lt=
;jack.bond-preston@foss.arm.com&gt;<br>
<b>Sent:</b> 03 June 2024 17:01<br>
<b>Cc:</b> dev@dpdk.org &lt;dev@dpdk.org&gt;<br>
<b>Subject:</b> [PATCH 0/5] OpenSSL PMD Optimisations</font>
<div>&nbsp;</div>
</div>
<div class=3D"BodyFragment"><font size=3D"2"><span style=3D"font-size:11pt;=
">
<div class=3D"PlainText">The current implementation of the OpenSSL PMD has =
numerous performance issues.<br>
These revolve around certain operations being performed on a per buffer/pac=
ket<br>
basis, when they in fact could be performed less often - usually just durin=
g<br>
initialisation.<br>
<br>
<br>
[1/5]: fix GCM and CCM thread unsafe ctxs<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is<=
br>
implemented in the same naive (and inefficient) way as existing fixes for o=
ther<br>
ciphers, and is optimised later in [3/5].<br>
<br>
<br>
[2/5]: only init 3DES-CTR key + impl once<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.<br>
<br>
<br>
[5/5]: only set cipher padding once<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
Fixes an inefficient usage of the OpenSSL API when disabling padding for<br=
>
ciphers. This behaviour was introduced in commit 6b283a03216e (&quot;crypto=
/openssl:<br>
fix extra bytes written at end of data&quot;), which fixes a bug - however,=
 the<br>
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.<br>
<br>
This patch fixes this, preventing the padding being disabled for the cipher=
<br>
twice per buffer (with the second essentially being a wasteful no-op).<br>
<br>
<br>
[3/5] and [4/5]: per-queue-pair context clones<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
[3/5] and [4/5] aim to fix the key issue that was identified with the<br>
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a<br>
per-buffer basis.<br>
This behaviour was introduced in 2019:<br>
&gt; commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a<br>
&gt; Author: Thierry Herbelot &lt;thierry.herbelot@6wind.com&gt;<br>
&gt; Date:&nbsp;&nbsp; Wed Sep 11 18:06:01 2019 +0200<br>
&gt;<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; crypto/openssl: use local copy for session con=
texts<br>
&gt;<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Session contexts are used for temporary storag=
e when processing a<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; packet.<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; If packets for the same session are to be proc=
essed simultaneously on<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; multiple cores, separate contexts must be used=
.<br>
&gt;<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Note: with openssl 1.1.1 EVP_CIPHER_CTX can no=
 longer be defined as a<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; variable on the stack: it must be allocated. T=
his in turn reduces the<br>
&gt;&nbsp;&nbsp;&nbsp;&nbsp; performance.<br>
<br>
Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be =
used<br>
from multiple threads simultaneously, so this patch is required for correct=
ness<br>
(assuming the need to support using the same openssl_session across multipl=
e<br>
lcores). The downside here is that, as the commit message notes, this does<=
br>
reduce performance quite significantly.<br>
<br>
It is worth noting that while ciphers were already correctly cloned for cip=
her<br>
ops and auth ops, this behaviour was actually absent for combined ops (AES-=
GCM<br>
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f<br=
>
(&quot;crypto/openssl: update HMAC routine with 3.0 EVP API&quot;). [1/5] a=
ddressed this<br>
issue of correctness, and [3/5] implements a more performant fix on top of =
this.<br>
<br>
These two patches aim to remedy the performance loss caused by the introduc=
tion<br>
of cipher context cloning. An approach of maintaining an array of pointers,=
<br>
inside the OpenSSL session structure, to per-queue-pair clones of the OpenS=
SL<br>
CTXs is used. Consequently, there is no need to perform cloning of the cont=
ext<br>
for every buffer - whilst keeping the guarantee that one context is not bei=
ng<br>
used on multiple lcores simultaneously. The cloning of the main context int=
o the<br>
array's per-qp context entries is performed lazily/as-needed. There are som=
e<br>
trade-offs/judgement calls that were made:<br>
&nbsp;- The first call for a queue pair for an op from a given openssl_sess=
ion will<br>
&nbsp;&nbsp; be roughly equivalent to an op from the existing implementatio=
n. However, all<br>
&nbsp;&nbsp; subsequent calls for the same openssl_session on the same queu=
e pair will not<br>
&nbsp;&nbsp; incur this extra work. Thus, whilst the first op on a session =
on a queue pair<br>
&nbsp;&nbsp; will be slower than subsequent ones, this slower first op is s=
till equivalent<br>
&nbsp;&nbsp; to *every* op without these patches. The alternative would be =
pre-populating<br>
&nbsp;&nbsp; this array when the openssl_session is initialised, but this w=
ould waste<br>
&nbsp;&nbsp; memory and processing time if not all queue pairs end up doing=
 work from this<br>
&nbsp;&nbsp; openssl_session.<br>
&nbsp;- Each pointer inside the array of per-queue-pair pointers has not be=
en cache<br>
&nbsp;&nbsp; aligned, because updates only occur on the first buffer per-qu=
eue-pair<br>
&nbsp;&nbsp; per-session, making the impact of false sharing negligible com=
pared to the<br>
&nbsp;&nbsp; extra memory usage of the alignment.<br>
<br>
[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4=
/5]<br>
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).<br>
<br>
Compared to before, this approach comes with a drawback of extra memory usa=
ge -<br>
the cause of which is twofold:<br>
- The openssl_session struct has grown to accommodate the array, with a len=
gth<br>
&nbsp; equal to the number of qps in use multiplied by 2 (to allow auth and=
 cipher<br>
&nbsp; contexts), per openssl_session structure. openssl_pmd_sym_session_ge=
t_size()<br>
&nbsp; is modified to return a size large enough to support this. At the ti=
me this<br>
&nbsp; function is called (before the user creates the session mempool), th=
e PMD may<br>
&nbsp; not yet be configured with the requested number of queue pairs. In t=
his case,<br>
&nbsp; the maximum number of queue pairs allowed by the PMD (current defaul=
t is 8) is<br>
&nbsp; used, to ensure the allocations will be large enough. Thus, the user=
 may be<br>
&nbsp; able to slightly reduce the memory used by OpenSSL sessions by first=
<br>
&nbsp; configuring the PMD's queue pair count, then requesting the size of =
the<br>
&nbsp; sessions and creating the session mempool. There is also a special c=
ase where<br>
&nbsp; the number of queue pairs is 1, in which case the array is not alloc=
ated or<br>
&nbsp; used at all. Overall, this memory usage by the session structure its=
elf is<br>
&nbsp; worst-case 128 bytes per session (the default maximum number of queu=
e pairs<br>
&nbsp; allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the=
 extra<br>
&nbsp; space to store the length of the array and auth context offset, resu=
lting in<br>
&nbsp; an increase in total size from 152 bytes to 280 bytes.<br>
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, t=
he<br>
&nbsp; clones were allocated and freed per-operation, meaning the lifetime =
of the<br>
&nbsp; allocations was only the duration of the operation. Now, these alloc=
ations are<br>
&nbsp; lifted out to share the lifetime of the session. This results in sit=
uations<br>
&nbsp; with many long-lived sessions shared across many queue pairs causing=
 an<br>
&nbsp; increase in total memory usage.<br>
<br>
<br>
Performance Comparisons<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
Benchmarks were collected using dpdk-test-crypto-perf, for the following<br=
>
configurations:<br>
&nbsp;- The version of OpenSSL used was 3.3.0<br>
&nbsp;- The hardware used for the benchmarks was the following two machine =
configs:<br>
&nbsp;&nbsp;&nbsp;&nbsp; * AArch64: Ampere Altra Max (128 N1 cores, 1 socke=
t)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * x86&nbsp;&nbsp;&nbsp; : Intel Xeon Platinum 8480=
+ (128 cores, 2 sockets)<br>
&nbsp;- The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 102=
4, 2048,<br>
&nbsp;&nbsp; 4096, 8192.<br>
&nbsp;- The worker lcore counts tested were: 1, 2, 4, 8<br>
&nbsp;- The algorithms and associated operations tested were:<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Cipher-only&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
AES-CBC-128&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (En=
crypt and Decrypt)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Cipher-only&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
3DES-CTR-128&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Encrypt=
 only)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Auth-only&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp; SHA1-HMAC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp; (Generate only)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Auth-only&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp; AES-CMAC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp; (Generate only)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * AESNI&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp; AES-GCM-128&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp; (Encrypt and Decrypt)<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Cipher-then-Auth&nbsp; AES-CBC-128-HMAC-SHA1 (En=
crypt only)<br>
&nbsp; - EAL was configured with Legacy Memory Mode enabled.<br>
The application was always run on isolated CPU cores on the same socket.<br=
>
<br>
The sets of patches applied for benchmarks were:<br>
&nbsp;- No patches applied (HEAD of upstream main)<br>
&nbsp;-&nbsp;&nbsp; [1/5] applied (fixes AES-GCM and AES-CCM concurrency is=
sue)<br>
&nbsp;- [1-2/5] applied (adds 3DES-CTR fix)<br>
&nbsp;- [1-3/5] applied (adds per-qp cipher contexts)<br>
&nbsp;- [1-4/5] applied (adds per-qp auth contexts)<br>
&nbsp;- [1-5/5] applied (adds cipher padding setting fix)<br>
<br>
For brevity, all results included in the cover letter are from the Arm plat=
form,<br>
with all patches applied. Very similar results were achieved on the Intel<b=
r>
platform, and the full set of results, including the Intel ones, is availab=
le.<br>
<br>
AES-CBC-128 Encrypt Throughput Speedup<br>
--------------------------------------<br>
A comparison of the throughput speedup achieved between the base (main bran=
ch<br>
HEAD) and optimised (all patches applied) versions of the PMD was carried o=
ut,<br>
with the varying worker lcore counts.<br>
<br>
1 worker lcore:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.84 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 2.04 |&nbsp;&nbsp; 144.6% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.61 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 3.72 |&nbsp;&nbsp; 131.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.97 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
6.24 |&nbsp;&nbsp; 110.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.14 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
9.42 |&nbsp;&nbsp;&nbsp; 83.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.10 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 12.62 =
|&nbsp;&nbsp;&nbsp; 55.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11.37 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.18 |&nbsp;&nb=
sp;&nbsp; 33.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 14.26 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 16.93 |&nbsp;&nb=
sp;&nbsp; 18.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 16.35 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 17.97 |&nbsp;&nb=
sp;&nbsp;&nbsp; 9.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 17.61 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.51 |&nbsp;&nb=
sp;&nbsp;&nbsp; 5.1% |<br>
<br>
8 worker lcores:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.53 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
6.49 |&nbsp;&nbsp; 974.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.04 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
9.85 |&nbsp;&nbsp; 881.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.96 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 50.07 =
|&nbsp;&nbsp; 739.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.54 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 75.53 |&nbs=
p;&nbsp; 616.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 21.60 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 101.14 |&nbsp;&nb=
sp; 368.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 41.27 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 121.56 |&nbsp;&nbsp; 1=
94.6% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 72.99 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 135.40 |&nbsp;&nbsp;&n=
bsp; 85.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 103.39 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 143.76 |&nbsp;&nbsp;&nbsp; =
39.0% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 125.48 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 148.06 |&nbsp;&nbsp;&nbsp; =
18.0% |<br>
<br>
It is evident from these results that the speedup with 8 worker lcores is<b=
r>
significantly larger. This was surprising at first, so profiling of the exi=
sting<br>
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX=
<br>
contains an EVP_CIPHER, which represents the actual cipher algorithm<br>
implementation backing this context. OpenSSL holds only one instance of eac=
h<br>
EVP_CIPHER, and uses a reference counter to track freeing them. This means =
that<br>
the original implementation spends a very high amount of time incrementing =
and<br>
decrementing this reference counter in EVP_CIPHER_CTX_copy and<br>
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lc=
ores,<br>
this reference count modification happens extremely frequently - thrashing =
this<br>
refcount on all lcores and causing a huge slowdown. The optimised version a=
voids<br>
this by not performing the copy and free (and thus associated refcount<br>
modifications) on every buffer.<br>
<br>
SHA1-HMAC Generate Throughput Speedup<br>
-------------------------------------<br>
1 worker lcore:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.32 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 0.76 |&nbsp;&nbsp; 135.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.63 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 1.43 |&nbsp;&nbsp; 126.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.21 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
2.60 |&nbsp;&nbsp; 115.4% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.23 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
4.42 |&nbsp;&nbsp;&nbsp; 98.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.88 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
6.80 |&nbsp;&nbsp;&nbsp; 75.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6.13 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.30 =
|&nbsp;&nbsp;&nbsp; 51.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.65 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11.39 |&nbs=
p;&nbsp;&nbsp; 31.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.90 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 12.85 |&nbsp;&nb=
sp;&nbsp; 17.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 12.54 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 13.74 |&nbsp;&nb=
sp;&nbsp;&nbsp; 9.5% |<br>
8 worker lcores:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.49 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 5.99 |&nbsp; 1110.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.98 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
1.30 |&nbsp; 1051.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.95 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 20.67 =
|&nbsp;&nbsp; 960.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.90 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 35.18 =
|&nbsp;&nbsp; 802.4% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.83 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 54.13 =
|&nbsp;&nbsp; 590.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.80 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 74.11 |&nbsp;&nb=
sp; 369.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 31.30 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 90.97 |&nbsp;&nb=
sp; 190.6% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 58.59 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 102.70 |&nbsp;&nbsp;&n=
bsp; 75.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 85.93 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 109.88 |&nbsp;&nbsp;&n=
bsp; 27.9% |<br>
<br>
We can see the results are similar as for AES-CBC-128 cipher operations.<br=
>
<br>
AES-GCM-128 Encrypt Throughput Speedup<br>
--------------------------------------<br>
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix f=
or<br>
the concurrency bug introduces a large overhead.<br>
<br>
1 worker lcore:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.60 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 1.31 |&nbsp;&nbsp; -49.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.69 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
4.45 |&nbsp;&nbsp; -42.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.33 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11.30 |&nbsp;&nb=
sp; -26.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.74 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.37 |&nbsp;&nb=
sp; -18.0% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 21.11 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.80 |&nbsp;&nb=
sp; -10.9% |<br>
<br>
8 worker lcores:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 19.94 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
2.83 |&nbsp;&nbsp; -85.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 58.84 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11.00 |&nbs=
p;&nbsp; -81.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 119.71 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 42.46 |&nbsp;&nbsp; -=
64.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 147.69 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 80.91 |&nbsp;&nbsp; -=
45.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 167.39 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 121.25 |&nbsp;&nbsp; -27.6%=
 |<br>
<br>
However, applying [3/5] rectifies most of this performance drop, as shown b=
y the<br>
following results with it applied.<br>
<br>
1 worker lcore:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.39 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 1.28 |&nbsp;&nbsp;&nbsp; -7.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.60 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 2.44 |&nbsp;&nbsp;&nbsp; -6.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.77 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
4.45 |&nbsp;&nbsp;&nbsp; -6.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.69 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
7.22 |&nbsp;&nbsp;&nbsp; -6.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11.31 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.97 |&nbs=
p;&nbsp;&nbsp; -3.0% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.33 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15.07 |&nbsp;&nb=
sp;&nbsp; -1.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.74 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 18.51 |&nbsp;&nb=
sp;&nbsp; -1.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 21.11 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 20.96 |&nbsp;&nb=
sp;&nbsp; -0.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 22.55 |&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 22.50 |&nbsp;&nb=
sp;&nbsp; -0.2% |<br>
<br>
8 worker lcores:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.59 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10.35 =
|&nbsp;&nbsp;&nbsp; -2.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 19.94 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 19.46 =
|&nbsp;&nbsp;&nbsp; -2.4% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 36.32 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 35.64 |&nbs=
p;&nbsp;&nbsp; -1.9% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 58.84 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 57.80 |&nbs=
p;&nbsp;&nbsp; -1.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 87.38 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 87.37 |&nbs=
p;&nbsp;&nbsp; -0.0% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 119.71 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 120.22 |&nbsp;&nbsp;&nbsp;&=
nbsp; 0.4% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 147.69 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 147.93 |&nbsp;&nbsp;&nbsp;&=
nbsp; 0.2% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 167.39 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 167.48 |&nbsp;&nbsp;&nbsp;&=
nbsp; 0.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 179.80 |&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 179.87 |&nbsp;&nbsp;&nbsp;&=
nbsp; 0.0% |<br>
<br>
The results show that, for AES-GCM-128 encrypt, there is still a small slow=
down<br>
at smaller buffer sizes. This represents the overhead required to make AES-=
GCM<br>
thread safe. These patches have rectified this lack of safety without causi=
ng a<br>
significant performance impact, especially compared to naive per-buffer cip=
her<br>
context cloning.<br>
<br>
3DES-CTR Encrypt<br>
----------------<br>
1 worker lcore:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.12 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 0.22 |&nbsp;&nbsp;&nbsp; 89.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.16 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 0.22 |&nbsp;&nbsp;&nbsp; 43.6% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.18 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
0.23 |&nbsp;&nbsp;&nbsp; 22.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.20 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
0.23 |&nbsp;&nbsp;&nbsp; 10.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.21 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
0.23 |&nbsp;&nbsp;&nbsp;&nbsp; 5.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.22 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 =
|&nbsp;&nbsp;&nbsp;&nbsp; 2.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.22 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 =
|&nbsp;&nbsp;&nbsp;&nbsp; 1.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 =
|&nbsp;&nbsp;&nbsp;&nbsp; 0.4% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.23 =
|&nbsp;&nbsp;&nbsp;&nbsp; 0.4% |<br>
<br>
8 worker lcores:<br>
|&nbsp;&nbsp; buffer sz (B) |&nbsp;&nbsp; prev (Gbps) |&nbsp;&nbsp; optimis=
ed (Gbps) |&nbsp;&nbsp; uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 32 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.68 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 1.77 |&nbsp;&nbsp; 160.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 64 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00 |&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp; 1.78 |&nbsp;&nbsp;&nbsp; 78.3% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1=
28 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.29 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
1.80 |&nbsp;&nbsp;&nbsp; 39.6% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2=
56 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.50 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
1.80 |&nbsp;&nbsp;&nbsp; 19.8% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5=
12 |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.64 |&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
1.80 |&nbsp;&nbsp;&nbsp; 10.0% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1024 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.72 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.81 =
|&nbsp;&nbsp;&nbsp;&nbsp; 5.1% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2048 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.76 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.81 =
|&nbsp;&nbsp;&nbsp;&nbsp; 2.7% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4096 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.78 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.81 =
|&nbsp;&nbsp;&nbsp;&nbsp; 1.5% |<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8192 |&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.80 |&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.81 =
|&nbsp;&nbsp;&nbsp;&nbsp; 0.7% |<br>
<br>
[1/4] yields good results - the performance increase is high for lower buff=
er<br>
sizes, where the cost of re-initialising the extra parameters is more<br>
significant compared to the cost of the cipher operation.<br>
<br>
Full Data and Additional Bar Charts<br>
-----------------------------------<br>
The full raw data (CSV) and a PDF of all generated figures (all generated<b=
r>
speedup tables, plus additional bar charts showing the throughput compariso=
n<br>
across different sets of applied patches) - for both Intel and Arm platform=
s -<br>
are available. However, I'm not sure of the ettiquette regarding attachment=
s of<br>
such files, so I haven't attached them for now. If you are interested in<br=
>
reviewing them, please reach out and I will find a way to get them to you.<=
br>
<br>
Jack Bond-Preston (5):<br>
&nbsp; crypto/openssl: fix GCM and CCM thread unsafe ctxs<br>
&nbsp; crypto/openssl: only init 3DES-CTR key + impl once<br>
&nbsp; crypto/openssl: per-qp cipher context clones<br>
&nbsp; crypto/openssl: per-qp auth context clones<br>
&nbsp; crypto/openssl: only set cipher padding once<br>
<br>
&nbsp;drivers/crypto/openssl/compat.h&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp; 26 ++<br>
&nbsp;drivers/crypto/openssl/openssl_pmd_private.h |&nbsp; 26 +-<br>
&nbsp;drivers/crypto/openssl/rte_openssl_pmd.c&nbsp;&nbsp;&nbsp;&nbsp; | 24=
4 ++++++++++++++-----<br>
&nbsp;drivers/crypto/openssl/rte_openssl_pmd_ops.c |&nbsp; 35 ++-<br>
&nbsp;4 files changed, 260 insertions(+), 71 deletions(-)<br>
<br>
-- <br>
2.34.1<br>
<br>
</div>
</span></font></div>
</body>
</html>

--_000_DS0PR11MB7458170E176873D1243E1CC081D42DS0PR11MB7458namp_--