DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/5] OpenSSL PMD Optimisations
@ 2024-06-03 16:01 Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  Cc: dev

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-03 16:01 ` Jack Bond-Preston
  2024-06-03 16:12   ` Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  To: Kai Ji, Fan Zhang, Akhil Goyal; +Cc: dev, stable, Wathsala Vithanage

Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
contexts") introduced a fix for concurrency bugs which could occur when
using one OpenSSL PMD session across multiple cores simultaneously. The
solution was to clone the EVP contexts per-buffer to avoid them being
used concurrently.

However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
routine with 3.0 EVP API") reverted this fix, only for combined ops
(AES-GCM and AES-CCM), with no explanation. This commit fixes the issue
again, essentially reverting this part of the commit.

Throughput performance uplift measurements for AES-GCM-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
Cc: stable@dpdk.org
Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index e8cb09defc..ca7ed30ec4 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -1590,6 +1590,9 @@ process_openssl_combined_op
 		return;
 	}
 
+	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
+	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
 	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC) {
@@ -1623,12 +1626,12 @@ process_openssl_combined_op
 			status = process_openssl_auth_encryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_encryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 
 	} else {
 		if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC ||
@@ -1636,14 +1639,16 @@ process_openssl_combined_op
 			status = process_openssl_auth_decryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_decryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 	}
 
+	EVP_CIPHER_CTX_free(ctx);
+
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/5] crypto/openssl: only init 3DES-CTR key + impl once
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-03 16:01 ` Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently the 3DES-CTR cipher context is initialised for every buffer,
setting the cipher implementation and key - even though for every
buffer in the session these values will be the same.

Change to initialising the cipher context once, before any buffers are
processed, instead.

Throughput performance uplift measurements for 3DES-CTR encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.16 |               0.21 |    35.3% |
|             256 |          0.20 |               0.22 |     9.4% |
|            1024 |          0.22 |               0.23 |     2.3% |
|            2048 |          0.22 |               0.23 |     0.9% |
|            4096 |          0.22 |               0.23 |     0.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.01 |               1.34 |    32.9% |
|             256 |          1.51 |               1.66 |     9.9% |
|            1024 |          1.72 |               1.77 |     2.6% |
|            2048 |          1.76 |               1.78 |     1.1% |
|            4096 |          1.79 |               1.80 |     0.6% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ca7ed30ec4..175ffda2b9 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -521,6 +521,15 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 				sess->cipher.key.length,
 				sess->cipher.key.data) != 0)
 			return -EINVAL;
+
+
+		/* We use 3DES encryption also for decryption.
+		 * IV is not important for 3DES ECB.
+		 */
+		if (EVP_EncryptInit_ex(sess->cipher.ctx, EVP_des_ede3_ecb(),
+				NULL, sess->cipher.key.data,  NULL) != 1)
+			return -EINVAL;
+
 		break;
 
 	case RTE_CRYPTO_CIPHER_DES_CBC:
@@ -1136,8 +1145,7 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 /** Process cipher des 3 ctr encryption, decryption algorithm */
 static int
 process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
-		int offset, uint8_t *iv, uint8_t *key, int srclen,
-		EVP_CIPHER_CTX *ctx)
+		int offset, uint8_t *iv, int srclen, EVP_CIPHER_CTX *ctx)
 {
 	uint8_t ebuf[8], ctr[8];
 	int unused, n;
@@ -1155,12 +1163,6 @@ process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 	l = rte_pktmbuf_data_len(m) - offset;
 
-	/* We use 3DES encryption also for decryption.
-	 * IV is not important for 3DES ecb
-	 */
-	if (EVP_EncryptInit_ex(ctx, EVP_des_ede3_ecb(), NULL, key, NULL) <= 0)
-		goto process_cipher_des3ctr_err;
-
 	memcpy(ctr, iv, 8);
 
 	for (n = 0; n < srclen; n++) {
@@ -1701,8 +1703,7 @@ process_openssl_cipher_op
 					srclen, ctx_copy, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv,
-				sess->cipher.key.data, srclen,
+				op->sym->cipher.data.offset, iv, srclen,
 				ctx_copy);
 
 	EVP_CIPHER_CTX_free(ctx_copy);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/5] crypto/openssl: per-qp cipher context clones
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
@ 2024-06-03 16:01 ` Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP_CIPHER_CTXs are allocated, copied to (from
openssl_session), and then freed for every cipher operation (ie. per
packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of pointers to per-queue-pair
cipher context copies. These are populated on first use by allocating a
new context and copying from the main context. These copies can then be
used in a thread-safe manner by different worker lcores simultaneously.
Consequently the cipher context allocation and copy only has to happen
once - the first time a given qp uses an openssl_session. This brings
about a large performance boost.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.51 |               2.94 |    94.4% |
|             256 |          4.90 |               8.05 |    64.3% |
|            1024 |         11.07 |              14.21 |    28.3% |
|            2048 |         14.03 |              16.28 |    16.0% |
|            4096 |         16.20 |              17.59 |     8.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          3.05 |              23.74 |   678.8% |
|             256 |         10.46 |              64.86 |   520.3% |
|            1024 |         40.97 |             113.80 |   177.7% |
|            2048 |         73.25 |             130.21 |    77.8% |
|            4096 |        103.89 |             140.62 |    35.4% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/openssl_pmd_private.h | 11 ++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 78 ++++++++++++++------
 drivers/crypto/openssl/rte_openssl_pmd_ops.c | 34 ++++++++-
 3 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index 0f038b218c..bad7dcf2f5 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -166,6 +166,14 @@ struct __rte_cache_aligned openssl_session {
 		/**< digest length */
 	} auth;
 
+	uint16_t ctx_copies_len;
+	/* < number of entries in ctx_copies */
+	EVP_CIPHER_CTX *qp_ctx[];
+	/**< Flexible array member of per-queue-pair pointers to copies of EVP
+	 * context structure. Cipher contexts are not safe to use from multiple
+	 * cores simultaneously, so maintaining these copies allows avoiding
+	 * per-buffer copying into a temporary context.
+	 */
 };
 
 /** OPENSSL crypto private asymmetric session structure */
@@ -217,7 +225,8 @@ struct __rte_cache_aligned openssl_asym_session {
 /** Set and validate OPENSSL crypto session parameters */
 extern int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform);
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs);
 
 /** Reset OPENSSL crypto session parameters */
 extern void
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 175ffda2b9..ebd1cab667 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -788,7 +788,8 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 /** Parse crypto xform chain and set private session parameters */
 int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform)
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs)
 {
 	const struct rte_crypto_sym_xform *cipher_xform = NULL;
 	const struct rte_crypto_sym_xform *auth_xform = NULL;
@@ -850,6 +851,12 @@ openssl_set_session_parameters(struct openssl_session *sess,
 		}
 	}
 
+	/*
+	 * With only one queue pair, the array of copies is not needed.
+	 * Otherwise, one entry per queue pair is required.
+	 */
+	sess->ctx_copies_len = nb_queue_pairs > 1 ? nb_queue_pairs : 0;
+
 	return 0;
 }
 
@@ -857,6 +864,13 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
+		if (sess->qp_ctx[i] != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
+			sess->qp_ctx[i] = NULL;
+		}
+	}
+
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
 	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
@@ -923,7 +937,7 @@ get_session(struct openssl_qp *qp, struct rte_crypto_op *op)
 		sess = (struct openssl_session *)_sess->driver_priv_data;
 
 		if (unlikely(openssl_set_session_parameters(sess,
-				op->sym->xform) != 0)) {
+				op->sym->xform, 1) != 0)) {
 			rte_mempool_put(qp->sess_mp, _sess);
 			sess = NULL;
 		}
@@ -1571,11 +1585,33 @@ process_openssl_auth_cmac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 # endif
 /*----------------------------------------------------------------------------*/
 
+static inline EVP_CIPHER_CTX *
+get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->cipher.ctx;
+
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30200000L
+		/* EVP_CIPHER_CTX_dup() added in OSSL 3.2 */
+		*lctx = EVP_CIPHER_CTX_dup(sess->cipher.ctx);
+#else
+		*lctx = EVP_CIPHER_CTX_new();
+		EVP_CIPHER_CTX_copy(*lctx, sess->cipher.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
-process_openssl_combined_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	/* cipher */
 	uint8_t *dst = NULL, *iv, *tag, *aad;
@@ -1592,8 +1628,7 @@ process_openssl_combined_op
 		return;
 	}
 
-	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
@@ -1649,8 +1684,6 @@ process_openssl_combined_op
 					dst, tag, taglen, ctx);
 	}
 
-	EVP_CIPHER_CTX_free(ctx);
-
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
@@ -1663,14 +1696,13 @@ process_openssl_combined_op
 
 /** Process cipher operation */
 static void
-process_openssl_cipher_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_cipher_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	uint8_t *dst, *iv;
 	int srclen, status;
 	uint8_t inplace = (mbuf_src == mbuf_dst) ? 1 : 0;
-	EVP_CIPHER_CTX *ctx_copy;
 
 	/*
 	 * Segmented OOP destination buffer is not supported for encryption/
@@ -1689,24 +1721,22 @@ process_openssl_cipher_op
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
-	ctx_copy = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx_copy, sess->cipher.ctx);
+
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	if (sess->cipher.mode == OPENSSL_CIPHER_LIB)
 		if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
 			status = process_openssl_cipher_encrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 		else
 			status = process_openssl_cipher_decrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv, srclen,
-				ctx_copy);
+				op->sym->cipher.data.offset, iv, srclen, ctx);
 
-	EVP_CIPHER_CTX_free(ctx_copy);
 	if (status != 0)
 		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
 }
@@ -3111,13 +3141,13 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->chain_order) {
 	case OPENSSL_CHAIN_ONLY_CIPHER:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_ONLY_AUTH:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_AUTH:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		/* OOP */
 		if (msrc != mdst)
 			copy_plaintext(msrc, mdst, op);
@@ -3125,10 +3155,10 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 		break;
 	case OPENSSL_CHAIN_AUTH_CIPHER:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_COMBINED:
-		process_openssl_combined_op(op, sess, msrc, mdst);
+		process_openssl_combined_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_BPI:
 		process_openssl_docsis_bpi_op(op, sess, msrc, mdst);
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index b16baaa08f..4209c6ab6f 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -794,9 +794,34 @@ openssl_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
 
 /** Returns the size of the symmetric session structure */
 static unsigned
-openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev __rte_unused)
+openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 {
-	return sizeof(struct openssl_session);
+	/*
+	 * For 0 qps, return the max size of the session - this is necessary if
+	 * the user calls into this function to create the session mempool,
+	 * without first configuring the number of qps for the cryptodev.
+	 */
+	if (dev->data->nb_queue_pairs == 0) {
+		unsigned int max_nb_qps = ((struct openssl_private *)
+				dev->data->dev_private)->max_nb_qpairs;
+		return sizeof(struct openssl_session) +
+				(sizeof(void *) * max_nb_qps);
+	}
+
+	/*
+	 * With only one queue pair, the thread safety of multiple context
+	 * copies is not necessary, so don't allocate extra memory for the
+	 * array.
+	 */
+	if (dev->data->nb_queue_pairs == 1)
+		return sizeof(struct openssl_session);
+
+	/*
+	 * Otherwise, the size of the flexible array member should be enough to
+	 * fit pointers to per-qp contexts.
+	 */
+	return sizeof(struct openssl_session) +
+		(sizeof(void *) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
@@ -808,7 +833,7 @@ openssl_pmd_asym_session_get_size(struct rte_cryptodev *dev __rte_unused)
 
 /** Configure the session from a crypto xform chain */
 static int
-openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
+openssl_pmd_sym_session_configure(struct rte_cryptodev *dev,
 		struct rte_crypto_sym_xform *xform,
 		struct rte_cryptodev_sym_session *sess)
 {
@@ -820,7 +845,8 @@ openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
 		return -EINVAL;
 	}
 
-	ret = openssl_set_session_parameters(sess_private_data, xform);
+	ret = openssl_set_session_parameters(sess_private_data, xform,
+			dev->data->nb_queue_pairs);
 	if (ret != 0) {
 		OPENSSL_LOG(ERR, "failed configure session parameters");
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/5] crypto/openssl: per-qp auth context clones
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (2 preceding siblings ...)
  2024-06-03 16:01 ` [PATCH 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
@ 2024-06-03 16:01 ` Jack Bond-Preston
  2024-06-03 16:30   ` Jack Bond-Preston
  2024-06-03 16:01 ` [PATCH 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP auth ctxs (e.g. EVP_MD_CTX, EVP_MAC_CTX) are allocated,
copied to (from openssl_session), and then freed for every auth
operation (ie. per packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of structures, containing
pointers to per-queue-pair cipher and auth context copies. These are
populated on first use by allocating a new context and copying from the
main context. These copies can then be used in a thread-safe manner by
different worker lcores simultaneously. Consequently the auth context
allocation and copy only has to happen once - the first time a given qp
uses an openssl_session. This brings about a large performance boost.

Throughput performance uplift measurements for HMAC-SHA1 generate on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.63 |               1.42 |   123.5% |
|             256 |          2.24 |               4.40 |    96.4% |
|            1024 |          6.15 |               9.26 |    50.6% |
|            2048 |          8.68 |              11.38 |    31.1% |
|            4096 |         10.92 |              12.84 |    17.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.93 |              11.35 |  1122.5% |
|             256 |          3.70 |              35.30 |   853.7% |
|            1024 |         15.22 |              74.27 |   387.8% |
|            2048 |         30.20 |              91.08 |   201.6% |
|            4096 |         56.92 |             102.76 |    80.5% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/compat.h              |  26 ++++
 drivers/crypto/openssl/openssl_pmd_private.h |  25 +++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 144 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |   7 +-
 4 files changed, 161 insertions(+), 41 deletions(-)

diff --git a/drivers/crypto/openssl/compat.h b/drivers/crypto/openssl/compat.h
index 9f9167c4f1..4c5ddfbf3a 100644
--- a/drivers/crypto/openssl/compat.h
+++ b/drivers/crypto/openssl/compat.h
@@ -5,6 +5,32 @@
 #ifndef __RTA_COMPAT_H__
 #define __RTA_COMPAT_H__
 
+#if OPENSSL_VERSION_NUMBER > 0x30000000L
+static __rte_always_inline void
+free_hmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+#else
+static __rte_always_inline void
+free_hmac_ctx(HMAC_CTX *ctx)
+{
+	HMAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(CMAC_CTX *ctx)
+{
+	CMAC_CTX_free(ctx);
+}
+#endif
+
 #if (OPENSSL_VERSION_NUMBER < 0x10100000L)
 
 static __rte_always_inline int
diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index bad7dcf2f5..c3740ccc62 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
 	 */
 };
 
+struct evp_ctx_pair {
+	EVP_CIPHER_CTX *cipher;
+	union {
+		EVP_MD_CTX *auth;
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		EVP_MAC_CTX *hmac;
+		EVP_MAC_CTX *cmac;
+#else
+		HMAC_CTX hmac;
+		CMAC_CTX cmac;
+#endif
+	};
+};
+
 /** OPENSSL crypto private session structure */
 struct __rte_cache_aligned openssl_session {
 	enum openssl_chain_order chain_order;
@@ -168,11 +182,12 @@ struct __rte_cache_aligned openssl_session {
 
 	uint16_t ctx_copies_len;
 	/* < number of entries in ctx_copies */
-	EVP_CIPHER_CTX *qp_ctx[];
-	/**< Flexible array member of per-queue-pair pointers to copies of EVP
-	 * context structure. Cipher contexts are not safe to use from multiple
-	 * cores simultaneously, so maintaining these copies allows avoiding
-	 * per-buffer copying into a temporary context.
+	struct evp_ctx_pair qp_ctx[];
+	/**< Flexible array member of per-queue-pair structures, each containing
+	 * pointers to copies of the cipher and auth EVP contexts. Cipher
+	 * contexts are not safe to use from multiple cores simultaneously, so
+	 * maintaining these copies allows avoiding per-buffer copying into a
+	 * temporary context.
 	 */
 };
 
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ebd1cab667..743b20c5b0 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -864,40 +864,45 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	/* Free all the qp_ctx entries. */
 	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
-		if (sess->qp_ctx[i] != NULL) {
-			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
-			sess->qp_ctx[i] = NULL;
+		if (sess->qp_ctx[i].cipher != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i].cipher);
+			sess->qp_ctx[i].cipher = NULL;
+		}
+
+		switch (sess->auth.mode) {
+		case OPENSSL_AUTH_AS_AUTH:
+			EVP_MD_CTX_destroy(sess->qp_ctx[i].auth);
+			sess->qp_ctx[i].auth = NULL;
+			break;
+		case OPENSSL_AUTH_AS_HMAC:
+			free_hmac_ctx(sess->qp_ctx[i].hmac);
+			sess->qp_ctx[i].hmac = NULL;
+			break;
+		case OPENSSL_AUTH_AS_CMAC:
+			free_cmac_ctx(sess->qp_ctx[i].cmac);
+			sess->qp_ctx[i].cmac = NULL;
+			break;
 		}
 	}
 
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
-	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
-		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
-
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
 		EVP_MD_CTX_destroy(sess->auth.auth.ctx);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
-		EVP_PKEY_free(sess->auth.hmac.pkey);
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.hmac.ctx);
-# else
-		HMAC_CTX_free(sess->auth.hmac.ctx);
-# endif
+		free_hmac_ctx(sess->auth.hmac.ctx);
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.cmac.ctx);
-# else
-		CMAC_CTX_free(sess->auth.cmac.ctx);
-# endif
-		break;
-	default:
+		free_cmac_ctx(sess->auth.cmac.ctx);
 		break;
 	}
+
+	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
+		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
 }
 
 /** Provide session for operation */
@@ -1443,6 +1448,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (m == 0)
 		goto process_auth_err;
 
+	if (EVP_MAC_init(ctx, NULL, 0, NULL) <= 0)
+		goto process_auth_err;
+
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 
 	l = rte_pktmbuf_data_len(m) - offset;
@@ -1469,11 +1477,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (EVP_MAC_final(ctx, dst, &dstlen, DIGEST_LENGTH_MAX) != 1)
 		goto process_auth_err;
 
-	EVP_MAC_CTX_free(ctx);
 	return 0;
 
 process_auth_err:
-	EVP_MAC_CTX_free(ctx);
 	OPENSSL_LOG(ERR, "Process openssl auth failed");
 	return -EINVAL;
 }
@@ -1592,7 +1598,7 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	if (sess->ctx_copies_len == 0)
 		return sess->cipher.ctx;
 
-	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id].cipher;
 
 	if (unlikely(*lctx == NULL)) {
 #if OPENSSL_VERSION_NUMBER >= 0x30200000L
@@ -1607,6 +1613,86 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	return *lctx;
 }
 
+static inline EVP_MD_CTX *
+get_local_auth_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.auth.ctx;
+
+	EVP_MD_CTX **lctx = &sess->qp_ctx[qp->id].auth;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30100000L
+		/* EVP_MD_CTX_dup() added in OSSL 3.1 */
+		*lctx = EVP_MD_CTX_dup(sess->auth.auth.ctx);
+#else
+		*lctx = EVP_MD_CTX_new();
+		EVP_MD_CTX_copy(*lctx, sess->auth.auth.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline HMAC_CTX *
+#endif
+get_local_hmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.hmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	HMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].hmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+		*lctx = HMAC_CTX_new();
+		HMAC_CTX_copy(*lctx, sess->auth.hmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline CMAC_CTX *
+#endif
+get_local_cmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.cmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	CMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].cmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+		*lctx = CMAC_CTX_new();
+		CMAC_CTX_copy(*lctx, sess->auth.cmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
 process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
@@ -1855,41 +1941,33 @@ process_openssl_auth_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
-		ctx_a = EVP_MD_CTX_create();
-		EVP_MD_CTX_copy_ex(ctx_a, sess->auth.auth.ctx);
+		ctx_a = get_local_auth_ctx(sess, qp);
 		status = process_openssl_auth(mbuf_src, dst,
 				op->sym->auth.data.offset, NULL, NULL, srclen,
 				ctx_a, sess->auth.auth.evp_algo);
-		EVP_MD_CTX_destroy(ctx_a);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
+		ctx_h = get_local_hmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_h = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
 # else
-		ctx_h = HMAC_CTX_new();
-		HMAC_CTX_copy(ctx_h, sess->auth.hmac.ctx);
 		status = process_openssl_auth_hmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
-		HMAC_CTX_free(ctx_h);
 # endif
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
+		ctx_c = get_local_cmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_c = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
 # else
-		ctx_c = CMAC_CTX_new();
-		CMAC_CTX_copy(ctx_c, sess->auth.cmac.ctx);
 		status = process_openssl_auth_cmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
-		CMAC_CTX_free(ctx_c);
 # endif
 		break;
 	default:
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index 4209c6ab6f..1bbb855a59 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -805,7 +805,7 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 		unsigned int max_nb_qps = ((struct openssl_private *)
 				dev->data->dev_private)->max_nb_qpairs;
 		return sizeof(struct openssl_session) +
-				(sizeof(void *) * max_nb_qps);
+				(sizeof(struct evp_ctx_pair) * max_nb_qps);
 	}
 
 	/*
@@ -818,10 +818,11 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 
 	/*
 	 * Otherwise, the size of the flexible array member should be enough to
-	 * fit pointers to per-qp contexts.
+	 * fit pointers to per-qp contexts. This is twice the number of queue
+	 * pairs, to allow for auth and cipher contexts.
 	 */
 	return sizeof(struct openssl_session) +
-		(sizeof(void *) * dev->data->nb_queue_pairs);
+		(sizeof(struct evp_ctx_pair) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/5] crypto/openssl: only set cipher padding once
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (3 preceding siblings ...)
  2024-06-03 16:01 ` [PATCH 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-03 16:01 ` Jack Bond-Preston
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:01 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Setting the cipher padding has a noticeable performance footprint,
and it doesn't need to be done for every call to
process_openssl_cipher_{en,de}crypt(). Setting it causes OpenSSL to set
it on every future context re-init. Thus, for every buffer after the
first one, the padding is being set twice.

Instead, just set the cipher padding once - when configuring the session
parameters - avoiding the unnecessary double setting behaviour. This is
skipped for AEAD ciphers, where disabling padding is not necessary.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.97 |               3.72 |    25.2% |
|             256 |          8.10 |               9.42 |    16.3% |
|            1024 |         14.22 |              15.18 |     6.8% |
|            2048 |         16.28 |              16.93 |     4.0% |
|            4096 |         17.58 |              17.97 |     2.2% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         21.27 |              29.85 |    40.3% |
|             256 |         60.05 |              75.53 |    25.8% |
|            1024 |        110.11 |             121.56 |    10.4% |
|            2048 |        128.05 |             135.40 |     5.7% |
|            4096 |        139.45 |             143.76 |     3.1% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 743b20c5b0..f0f5082769 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -595,6 +595,8 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 		return -ENOTSUP;
 	}
 
+	EVP_CIPHER_CTX_set_padding(sess->cipher.ctx, 0);
+
 	return 0;
 }
 
@@ -1096,8 +1098,6 @@ process_openssl_cipher_encrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_encrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_encryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_encrypt_err;
@@ -1146,8 +1146,6 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_DecryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_decrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_decryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_decrypt_err;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-03 16:01 ` [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-03 16:12   ` Jack Bond-Preston
  0 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:12 UTC (permalink / raw)
  To: dev

On 03/06/2024 17:01, Jack Bond-Preston wrote:
> <snip>
> +	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
> +	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
> +
> <snip>
This, and other patches in the set, are throwing a checkpatch error:
> _coding style issues_
> 
> 
> ERROR:SPACING: need consistent spacing around '*' (ctx:WxV)
> #99: FILE: drivers/crypto/openssl/rte_openssl_pmd.c:1593:
> +	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
>  	               ^
> 
> total: 1 errors, 0 warnings, 41 lines checked
Which I believe is a false positive - is it that checkpatch doesn't 
recognise the type EVP_CIPHER_CTX and thinks this is multiplication?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] crypto/openssl: per-qp auth context clones
  2024-06-03 16:01 ` [PATCH 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-03 16:30   ` Jack Bond-Preston
  0 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 16:30 UTC (permalink / raw)
  To: dev

On 03/06/2024 17:01, Jack Bond-Preston wrote:
> diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
> index bad7dcf2f5..c3740ccc62 100644
> --- a/drivers/crypto/openssl/openssl_pmd_private.h
> +++ b/drivers/crypto/openssl/openssl_pmd_private.h
> @@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
>   	 */
>   };
>   
> +struct evp_ctx_pair {
> +	EVP_CIPHER_CTX *cipher;
> +	union {
> +		EVP_MD_CTX *auth;
> +#if OPENSSL_VERSION_NUMBER >= 0x30000000L
> +		EVP_MAC_CTX *hmac;
> +		EVP_MAC_CTX *cmac;
> +#else
> +		HMAC_CTX hmac;
> +		CMAC_CTX cmac;
> +#endif
> +	};
> +};
> +

HMAC_CTX and CMAC_CTX should be pointers, this is causing CI failures 
for older OpenSSL versions.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/5] OpenSSL PMD Optimisations
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (4 preceding siblings ...)
  2024-06-03 16:01 ` [PATCH 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
@ 2024-06-03 18:43 ` Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
                     ` (4 more replies)
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (3 subsequent siblings)
  9 siblings, 5 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  Cc: dev

v2:
* Fixed missing * in patch 4 causing compilation failures.

---

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-03 18:43   ` Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  To: Kai Ji, Fan Zhang, Akhil Goyal; +Cc: dev, stable, Wathsala Vithanage

Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
contexts") introduced a fix for concurrency bugs which could occur when
using one OpenSSL PMD session across multiple cores simultaneously. The
solution was to clone the EVP contexts per-buffer to avoid them being
used concurrently.

However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
routine with 3.0 EVP API") reverted this fix, only for combined ops
(AES-GCM and AES-CCM), with no explanation. This commit fixes the issue
again, essentially reverting this part of the commit.

Throughput performance uplift measurements for AES-GCM-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
Cc: stable@dpdk.org
Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index e8cb09defc..ca7ed30ec4 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -1590,6 +1590,9 @@ process_openssl_combined_op
 		return;
 	}
 
+	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
+	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
 	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC) {
@@ -1623,12 +1626,12 @@ process_openssl_combined_op
 			status = process_openssl_auth_encryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_encryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 
 	} else {
 		if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC ||
@@ -1636,14 +1639,16 @@ process_openssl_combined_op
 			status = process_openssl_auth_decryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_decryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 	}
 
+	EVP_CIPHER_CTX_free(ctx);
+
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-03 18:43   ` Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently the 3DES-CTR cipher context is initialised for every buffer,
setting the cipher implementation and key - even though for every
buffer in the session these values will be the same.

Change to initialising the cipher context once, before any buffers are
processed, instead.

Throughput performance uplift measurements for 3DES-CTR encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.16 |               0.21 |    35.3% |
|             256 |          0.20 |               0.22 |     9.4% |
|            1024 |          0.22 |               0.23 |     2.3% |
|            2048 |          0.22 |               0.23 |     0.9% |
|            4096 |          0.22 |               0.23 |     0.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.01 |               1.34 |    32.9% |
|             256 |          1.51 |               1.66 |     9.9% |
|            1024 |          1.72 |               1.77 |     2.6% |
|            2048 |          1.76 |               1.78 |     1.1% |
|            4096 |          1.79 |               1.80 |     0.6% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ca7ed30ec4..175ffda2b9 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -521,6 +521,15 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 				sess->cipher.key.length,
 				sess->cipher.key.data) != 0)
 			return -EINVAL;
+
+
+		/* We use 3DES encryption also for decryption.
+		 * IV is not important for 3DES ECB.
+		 */
+		if (EVP_EncryptInit_ex(sess->cipher.ctx, EVP_des_ede3_ecb(),
+				NULL, sess->cipher.key.data,  NULL) != 1)
+			return -EINVAL;
+
 		break;
 
 	case RTE_CRYPTO_CIPHER_DES_CBC:
@@ -1136,8 +1145,7 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 /** Process cipher des 3 ctr encryption, decryption algorithm */
 static int
 process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
-		int offset, uint8_t *iv, uint8_t *key, int srclen,
-		EVP_CIPHER_CTX *ctx)
+		int offset, uint8_t *iv, int srclen, EVP_CIPHER_CTX *ctx)
 {
 	uint8_t ebuf[8], ctr[8];
 	int unused, n;
@@ -1155,12 +1163,6 @@ process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 	l = rte_pktmbuf_data_len(m) - offset;
 
-	/* We use 3DES encryption also for decryption.
-	 * IV is not important for 3DES ecb
-	 */
-	if (EVP_EncryptInit_ex(ctx, EVP_des_ede3_ecb(), NULL, key, NULL) <= 0)
-		goto process_cipher_des3ctr_err;
-
 	memcpy(ctr, iv, 8);
 
 	for (n = 0; n < srclen; n++) {
@@ -1701,8 +1703,7 @@ process_openssl_cipher_op
 					srclen, ctx_copy, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv,
-				sess->cipher.key.data, srclen,
+				op->sym->cipher.data.offset, iv, srclen,
 				ctx_copy);
 
 	EVP_CIPHER_CTX_free(ctx_copy);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
@ 2024-06-03 18:43   ` Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP_CIPHER_CTXs are allocated, copied to (from
openssl_session), and then freed for every cipher operation (ie. per
packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of pointers to per-queue-pair
cipher context copies. These are populated on first use by allocating a
new context and copying from the main context. These copies can then be
used in a thread-safe manner by different worker lcores simultaneously.
Consequently the cipher context allocation and copy only has to happen
once - the first time a given qp uses an openssl_session. This brings
about a large performance boost.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.51 |               2.94 |    94.4% |
|             256 |          4.90 |               8.05 |    64.3% |
|            1024 |         11.07 |              14.21 |    28.3% |
|            2048 |         14.03 |              16.28 |    16.0% |
|            4096 |         16.20 |              17.59 |     8.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          3.05 |              23.74 |   678.8% |
|             256 |         10.46 |              64.86 |   520.3% |
|            1024 |         40.97 |             113.80 |   177.7% |
|            2048 |         73.25 |             130.21 |    77.8% |
|            4096 |        103.89 |             140.62 |    35.4% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/openssl_pmd_private.h | 11 ++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 78 ++++++++++++++------
 drivers/crypto/openssl/rte_openssl_pmd_ops.c | 34 ++++++++-
 3 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index 0f038b218c..bad7dcf2f5 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -166,6 +166,14 @@ struct __rte_cache_aligned openssl_session {
 		/**< digest length */
 	} auth;
 
+	uint16_t ctx_copies_len;
+	/* < number of entries in ctx_copies */
+	EVP_CIPHER_CTX *qp_ctx[];
+	/**< Flexible array member of per-queue-pair pointers to copies of EVP
+	 * context structure. Cipher contexts are not safe to use from multiple
+	 * cores simultaneously, so maintaining these copies allows avoiding
+	 * per-buffer copying into a temporary context.
+	 */
 };
 
 /** OPENSSL crypto private asymmetric session structure */
@@ -217,7 +225,8 @@ struct __rte_cache_aligned openssl_asym_session {
 /** Set and validate OPENSSL crypto session parameters */
 extern int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform);
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs);
 
 /** Reset OPENSSL crypto session parameters */
 extern void
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 175ffda2b9..ebd1cab667 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -788,7 +788,8 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 /** Parse crypto xform chain and set private session parameters */
 int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform)
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs)
 {
 	const struct rte_crypto_sym_xform *cipher_xform = NULL;
 	const struct rte_crypto_sym_xform *auth_xform = NULL;
@@ -850,6 +851,12 @@ openssl_set_session_parameters(struct openssl_session *sess,
 		}
 	}
 
+	/*
+	 * With only one queue pair, the array of copies is not needed.
+	 * Otherwise, one entry per queue pair is required.
+	 */
+	sess->ctx_copies_len = nb_queue_pairs > 1 ? nb_queue_pairs : 0;
+
 	return 0;
 }
 
@@ -857,6 +864,13 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
+		if (sess->qp_ctx[i] != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
+			sess->qp_ctx[i] = NULL;
+		}
+	}
+
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
 	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
@@ -923,7 +937,7 @@ get_session(struct openssl_qp *qp, struct rte_crypto_op *op)
 		sess = (struct openssl_session *)_sess->driver_priv_data;
 
 		if (unlikely(openssl_set_session_parameters(sess,
-				op->sym->xform) != 0)) {
+				op->sym->xform, 1) != 0)) {
 			rte_mempool_put(qp->sess_mp, _sess);
 			sess = NULL;
 		}
@@ -1571,11 +1585,33 @@ process_openssl_auth_cmac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 # endif
 /*----------------------------------------------------------------------------*/
 
+static inline EVP_CIPHER_CTX *
+get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->cipher.ctx;
+
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30200000L
+		/* EVP_CIPHER_CTX_dup() added in OSSL 3.2 */
+		*lctx = EVP_CIPHER_CTX_dup(sess->cipher.ctx);
+#else
+		*lctx = EVP_CIPHER_CTX_new();
+		EVP_CIPHER_CTX_copy(*lctx, sess->cipher.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
-process_openssl_combined_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	/* cipher */
 	uint8_t *dst = NULL, *iv, *tag, *aad;
@@ -1592,8 +1628,7 @@ process_openssl_combined_op
 		return;
 	}
 
-	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
@@ -1649,8 +1684,6 @@ process_openssl_combined_op
 					dst, tag, taglen, ctx);
 	}
 
-	EVP_CIPHER_CTX_free(ctx);
-
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
@@ -1663,14 +1696,13 @@ process_openssl_combined_op
 
 /** Process cipher operation */
 static void
-process_openssl_cipher_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_cipher_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	uint8_t *dst, *iv;
 	int srclen, status;
 	uint8_t inplace = (mbuf_src == mbuf_dst) ? 1 : 0;
-	EVP_CIPHER_CTX *ctx_copy;
 
 	/*
 	 * Segmented OOP destination buffer is not supported for encryption/
@@ -1689,24 +1721,22 @@ process_openssl_cipher_op
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
-	ctx_copy = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx_copy, sess->cipher.ctx);
+
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	if (sess->cipher.mode == OPENSSL_CIPHER_LIB)
 		if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
 			status = process_openssl_cipher_encrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 		else
 			status = process_openssl_cipher_decrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv, srclen,
-				ctx_copy);
+				op->sym->cipher.data.offset, iv, srclen, ctx);
 
-	EVP_CIPHER_CTX_free(ctx_copy);
 	if (status != 0)
 		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
 }
@@ -3111,13 +3141,13 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->chain_order) {
 	case OPENSSL_CHAIN_ONLY_CIPHER:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_ONLY_AUTH:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_AUTH:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		/* OOP */
 		if (msrc != mdst)
 			copy_plaintext(msrc, mdst, op);
@@ -3125,10 +3155,10 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 		break;
 	case OPENSSL_CHAIN_AUTH_CIPHER:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_COMBINED:
-		process_openssl_combined_op(op, sess, msrc, mdst);
+		process_openssl_combined_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_BPI:
 		process_openssl_docsis_bpi_op(op, sess, msrc, mdst);
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index b16baaa08f..4209c6ab6f 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -794,9 +794,34 @@ openssl_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
 
 /** Returns the size of the symmetric session structure */
 static unsigned
-openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev __rte_unused)
+openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 {
-	return sizeof(struct openssl_session);
+	/*
+	 * For 0 qps, return the max size of the session - this is necessary if
+	 * the user calls into this function to create the session mempool,
+	 * without first configuring the number of qps for the cryptodev.
+	 */
+	if (dev->data->nb_queue_pairs == 0) {
+		unsigned int max_nb_qps = ((struct openssl_private *)
+				dev->data->dev_private)->max_nb_qpairs;
+		return sizeof(struct openssl_session) +
+				(sizeof(void *) * max_nb_qps);
+	}
+
+	/*
+	 * With only one queue pair, the thread safety of multiple context
+	 * copies is not necessary, so don't allocate extra memory for the
+	 * array.
+	 */
+	if (dev->data->nb_queue_pairs == 1)
+		return sizeof(struct openssl_session);
+
+	/*
+	 * Otherwise, the size of the flexible array member should be enough to
+	 * fit pointers to per-qp contexts.
+	 */
+	return sizeof(struct openssl_session) +
+		(sizeof(void *) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
@@ -808,7 +833,7 @@ openssl_pmd_asym_session_get_size(struct rte_cryptodev *dev __rte_unused)
 
 /** Configure the session from a crypto xform chain */
 static int
-openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
+openssl_pmd_sym_session_configure(struct rte_cryptodev *dev,
 		struct rte_crypto_sym_xform *xform,
 		struct rte_cryptodev_sym_session *sess)
 {
@@ -820,7 +845,8 @@ openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
 		return -EINVAL;
 	}
 
-	ret = openssl_set_session_parameters(sess_private_data, xform);
+	ret = openssl_set_session_parameters(sess_private_data, xform,
+			dev->data->nb_queue_pairs);
 	if (ret != 0) {
 		OPENSSL_LOG(ERR, "failed configure session parameters");
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 4/5] crypto/openssl: per-qp auth context clones
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (2 preceding siblings ...)
  2024-06-03 18:43   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
@ 2024-06-03 18:43   ` Jack Bond-Preston
  2024-06-03 18:43   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP auth ctxs (e.g. EVP_MD_CTX, EVP_MAC_CTX) are allocated,
copied to (from openssl_session), and then freed for every auth
operation (ie. per packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of structures, containing
pointers to per-queue-pair cipher and auth context copies. These are
populated on first use by allocating a new context and copying from the
main context. These copies can then be used in a thread-safe manner by
different worker lcores simultaneously. Consequently the auth context
allocation and copy only has to happen once - the first time a given qp
uses an openssl_session. This brings about a large performance boost.

Throughput performance uplift measurements for HMAC-SHA1 generate on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.63 |               1.42 |   123.5% |
|             256 |          2.24 |               4.40 |    96.4% |
|            1024 |          6.15 |               9.26 |    50.6% |
|            2048 |          8.68 |              11.38 |    31.1% |
|            4096 |         10.92 |              12.84 |    17.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.93 |              11.35 |  1122.5% |
|             256 |          3.70 |              35.30 |   853.7% |
|            1024 |         15.22 |              74.27 |   387.8% |
|            2048 |         30.20 |              91.08 |   201.6% |
|            4096 |         56.92 |             102.76 |    80.5% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/compat.h              |  26 ++++
 drivers/crypto/openssl/openssl_pmd_private.h |  25 +++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 144 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |   7 +-
 4 files changed, 161 insertions(+), 41 deletions(-)

diff --git a/drivers/crypto/openssl/compat.h b/drivers/crypto/openssl/compat.h
index 9f9167c4f1..4c5ddfbf3a 100644
--- a/drivers/crypto/openssl/compat.h
+++ b/drivers/crypto/openssl/compat.h
@@ -5,6 +5,32 @@
 #ifndef __RTA_COMPAT_H__
 #define __RTA_COMPAT_H__
 
+#if OPENSSL_VERSION_NUMBER > 0x30000000L
+static __rte_always_inline void
+free_hmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+#else
+static __rte_always_inline void
+free_hmac_ctx(HMAC_CTX *ctx)
+{
+	HMAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(CMAC_CTX *ctx)
+{
+	CMAC_CTX_free(ctx);
+}
+#endif
+
 #if (OPENSSL_VERSION_NUMBER < 0x10100000L)
 
 static __rte_always_inline int
diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index bad7dcf2f5..a50e4d4918 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
 	 */
 };
 
+struct evp_ctx_pair {
+	EVP_CIPHER_CTX *cipher;
+	union {
+		EVP_MD_CTX *auth;
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		EVP_MAC_CTX *hmac;
+		EVP_MAC_CTX *cmac;
+#else
+		HMAC_CTX *hmac;
+		CMAC_CTX *cmac;
+#endif
+	};
+};
+
 /** OPENSSL crypto private session structure */
 struct __rte_cache_aligned openssl_session {
 	enum openssl_chain_order chain_order;
@@ -168,11 +182,12 @@ struct __rte_cache_aligned openssl_session {
 
 	uint16_t ctx_copies_len;
 	/* < number of entries in ctx_copies */
-	EVP_CIPHER_CTX *qp_ctx[];
-	/**< Flexible array member of per-queue-pair pointers to copies of EVP
-	 * context structure. Cipher contexts are not safe to use from multiple
-	 * cores simultaneously, so maintaining these copies allows avoiding
-	 * per-buffer copying into a temporary context.
+	struct evp_ctx_pair qp_ctx[];
+	/**< Flexible array member of per-queue-pair structures, each containing
+	 * pointers to copies of the cipher and auth EVP contexts. Cipher
+	 * contexts are not safe to use from multiple cores simultaneously, so
+	 * maintaining these copies allows avoiding per-buffer copying into a
+	 * temporary context.
 	 */
 };
 
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ebd1cab667..743b20c5b0 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -864,40 +864,45 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	/* Free all the qp_ctx entries. */
 	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
-		if (sess->qp_ctx[i] != NULL) {
-			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
-			sess->qp_ctx[i] = NULL;
+		if (sess->qp_ctx[i].cipher != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i].cipher);
+			sess->qp_ctx[i].cipher = NULL;
+		}
+
+		switch (sess->auth.mode) {
+		case OPENSSL_AUTH_AS_AUTH:
+			EVP_MD_CTX_destroy(sess->qp_ctx[i].auth);
+			sess->qp_ctx[i].auth = NULL;
+			break;
+		case OPENSSL_AUTH_AS_HMAC:
+			free_hmac_ctx(sess->qp_ctx[i].hmac);
+			sess->qp_ctx[i].hmac = NULL;
+			break;
+		case OPENSSL_AUTH_AS_CMAC:
+			free_cmac_ctx(sess->qp_ctx[i].cmac);
+			sess->qp_ctx[i].cmac = NULL;
+			break;
 		}
 	}
 
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
-	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
-		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
-
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
 		EVP_MD_CTX_destroy(sess->auth.auth.ctx);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
-		EVP_PKEY_free(sess->auth.hmac.pkey);
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.hmac.ctx);
-# else
-		HMAC_CTX_free(sess->auth.hmac.ctx);
-# endif
+		free_hmac_ctx(sess->auth.hmac.ctx);
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.cmac.ctx);
-# else
-		CMAC_CTX_free(sess->auth.cmac.ctx);
-# endif
-		break;
-	default:
+		free_cmac_ctx(sess->auth.cmac.ctx);
 		break;
 	}
+
+	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
+		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
 }
 
 /** Provide session for operation */
@@ -1443,6 +1448,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (m == 0)
 		goto process_auth_err;
 
+	if (EVP_MAC_init(ctx, NULL, 0, NULL) <= 0)
+		goto process_auth_err;
+
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 
 	l = rte_pktmbuf_data_len(m) - offset;
@@ -1469,11 +1477,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (EVP_MAC_final(ctx, dst, &dstlen, DIGEST_LENGTH_MAX) != 1)
 		goto process_auth_err;
 
-	EVP_MAC_CTX_free(ctx);
 	return 0;
 
 process_auth_err:
-	EVP_MAC_CTX_free(ctx);
 	OPENSSL_LOG(ERR, "Process openssl auth failed");
 	return -EINVAL;
 }
@@ -1592,7 +1598,7 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	if (sess->ctx_copies_len == 0)
 		return sess->cipher.ctx;
 
-	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id].cipher;
 
 	if (unlikely(*lctx == NULL)) {
 #if OPENSSL_VERSION_NUMBER >= 0x30200000L
@@ -1607,6 +1613,86 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	return *lctx;
 }
 
+static inline EVP_MD_CTX *
+get_local_auth_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.auth.ctx;
+
+	EVP_MD_CTX **lctx = &sess->qp_ctx[qp->id].auth;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30100000L
+		/* EVP_MD_CTX_dup() added in OSSL 3.1 */
+		*lctx = EVP_MD_CTX_dup(sess->auth.auth.ctx);
+#else
+		*lctx = EVP_MD_CTX_new();
+		EVP_MD_CTX_copy(*lctx, sess->auth.auth.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline HMAC_CTX *
+#endif
+get_local_hmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.hmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	HMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].hmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+		*lctx = HMAC_CTX_new();
+		HMAC_CTX_copy(*lctx, sess->auth.hmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline CMAC_CTX *
+#endif
+get_local_cmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.cmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	CMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].cmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+		*lctx = CMAC_CTX_new();
+		CMAC_CTX_copy(*lctx, sess->auth.cmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
 process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
@@ -1855,41 +1941,33 @@ process_openssl_auth_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
-		ctx_a = EVP_MD_CTX_create();
-		EVP_MD_CTX_copy_ex(ctx_a, sess->auth.auth.ctx);
+		ctx_a = get_local_auth_ctx(sess, qp);
 		status = process_openssl_auth(mbuf_src, dst,
 				op->sym->auth.data.offset, NULL, NULL, srclen,
 				ctx_a, sess->auth.auth.evp_algo);
-		EVP_MD_CTX_destroy(ctx_a);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
+		ctx_h = get_local_hmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_h = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
 # else
-		ctx_h = HMAC_CTX_new();
-		HMAC_CTX_copy(ctx_h, sess->auth.hmac.ctx);
 		status = process_openssl_auth_hmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
-		HMAC_CTX_free(ctx_h);
 # endif
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
+		ctx_c = get_local_cmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_c = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
 # else
-		ctx_c = CMAC_CTX_new();
-		CMAC_CTX_copy(ctx_c, sess->auth.cmac.ctx);
 		status = process_openssl_auth_cmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
-		CMAC_CTX_free(ctx_c);
 # endif
 		break;
 	default:
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index 4209c6ab6f..1bbb855a59 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -805,7 +805,7 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 		unsigned int max_nb_qps = ((struct openssl_private *)
 				dev->data->dev_private)->max_nb_qpairs;
 		return sizeof(struct openssl_session) +
-				(sizeof(void *) * max_nb_qps);
+				(sizeof(struct evp_ctx_pair) * max_nb_qps);
 	}
 
 	/*
@@ -818,10 +818,11 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 
 	/*
 	 * Otherwise, the size of the flexible array member should be enough to
-	 * fit pointers to per-qp contexts.
+	 * fit pointers to per-qp contexts. This is twice the number of queue
+	 * pairs, to allow for auth and cipher contexts.
 	 */
 	return sizeof(struct openssl_session) +
-		(sizeof(void *) * dev->data->nb_queue_pairs);
+		(sizeof(struct evp_ctx_pair) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 5/5] crypto/openssl: only set cipher padding once
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (3 preceding siblings ...)
  2024-06-03 18:43   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-03 18:43   ` Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:43 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Setting the cipher padding has a noticeable performance footprint,
and it doesn't need to be done for every call to
process_openssl_cipher_{en,de}crypt(). Setting it causes OpenSSL to set
it on every future context re-init. Thus, for every buffer after the
first one, the padding is being set twice.

Instead, just set the cipher padding once - when configuring the session
parameters - avoiding the unnecessary double setting behaviour. This is
skipped for AEAD ciphers, where disabling padding is not necessary.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.97 |               3.72 |    25.2% |
|             256 |          8.10 |               9.42 |    16.3% |
|            1024 |         14.22 |              15.18 |     6.8% |
|            2048 |         16.28 |              16.93 |     4.0% |
|            4096 |         17.58 |              17.97 |     2.2% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         21.27 |              29.85 |    40.3% |
|             256 |         60.05 |              75.53 |    25.8% |
|            1024 |        110.11 |             121.56 |    10.4% |
|            2048 |        128.05 |             135.40 |     5.7% |
|            4096 |        139.45 |             143.76 |     3.1% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 743b20c5b0..f0f5082769 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -595,6 +595,8 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 		return -ENOTSUP;
 	}
 
+	EVP_CIPHER_CTX_set_padding(sess->cipher.ctx, 0);
+
 	return 0;
 }
 
@@ -1096,8 +1098,6 @@ process_openssl_cipher_encrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_encrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_encryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_encrypt_err;
@@ -1146,8 +1146,6 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_DecryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_decrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_decryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_decrypt_err;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/5] OpenSSL PMD Optimisations
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (5 preceding siblings ...)
  2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-03 18:59 ` Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
                     ` (4 more replies)
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (2 subsequent siblings)
  9 siblings, 5 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  Cc: dev

v2:
* Fixed missing * in patch 4 causing compilation failures.

---

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-03 18:59   ` Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  To: Kai Ji, Fan Zhang, Akhil Goyal; +Cc: dev, stable, Wathsala Vithanage

Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
contexts") introduced a fix for concurrency bugs which could occur when
using one OpenSSL PMD session across multiple cores simultaneously. The
solution was to clone the EVP contexts per-buffer to avoid them being
used concurrently.

However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
routine with 3.0 EVP API") reverted this fix, only for combined ops
(AES-GCM and AES-CCM), with no explanation. This commit fixes the issue
again, essentially reverting this part of the commit.

Throughput performance uplift measurements for AES-GCM-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
Cc: stable@dpdk.org
Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index e8cb09defc..ca7ed30ec4 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -1590,6 +1590,9 @@ process_openssl_combined_op
 		return;
 	}
 
+	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
+	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
 	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC) {
@@ -1623,12 +1626,12 @@ process_openssl_combined_op
 			status = process_openssl_auth_encryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_encryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 
 	} else {
 		if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC ||
@@ -1636,14 +1639,16 @@ process_openssl_combined_op
 			status = process_openssl_auth_decryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_decryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 	}
 
+	EVP_CIPHER_CTX_free(ctx);
+
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-03 18:59   ` Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently the 3DES-CTR cipher context is initialised for every buffer,
setting the cipher implementation and key - even though for every
buffer in the session these values will be the same.

Change to initialising the cipher context once, before any buffers are
processed, instead.

Throughput performance uplift measurements for 3DES-CTR encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.16 |               0.21 |    35.3% |
|             256 |          0.20 |               0.22 |     9.4% |
|            1024 |          0.22 |               0.23 |     2.3% |
|            2048 |          0.22 |               0.23 |     0.9% |
|            4096 |          0.22 |               0.23 |     0.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.01 |               1.34 |    32.9% |
|             256 |          1.51 |               1.66 |     9.9% |
|            1024 |          1.72 |               1.77 |     2.6% |
|            2048 |          1.76 |               1.78 |     1.1% |
|            4096 |          1.79 |               1.80 |     0.6% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ca7ed30ec4..175ffda2b9 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -521,6 +521,15 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 				sess->cipher.key.length,
 				sess->cipher.key.data) != 0)
 			return -EINVAL;
+
+
+		/* We use 3DES encryption also for decryption.
+		 * IV is not important for 3DES ECB.
+		 */
+		if (EVP_EncryptInit_ex(sess->cipher.ctx, EVP_des_ede3_ecb(),
+				NULL, sess->cipher.key.data,  NULL) != 1)
+			return -EINVAL;
+
 		break;
 
 	case RTE_CRYPTO_CIPHER_DES_CBC:
@@ -1136,8 +1145,7 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 /** Process cipher des 3 ctr encryption, decryption algorithm */
 static int
 process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
-		int offset, uint8_t *iv, uint8_t *key, int srclen,
-		EVP_CIPHER_CTX *ctx)
+		int offset, uint8_t *iv, int srclen, EVP_CIPHER_CTX *ctx)
 {
 	uint8_t ebuf[8], ctr[8];
 	int unused, n;
@@ -1155,12 +1163,6 @@ process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 	l = rte_pktmbuf_data_len(m) - offset;
 
-	/* We use 3DES encryption also for decryption.
-	 * IV is not important for 3DES ecb
-	 */
-	if (EVP_EncryptInit_ex(ctx, EVP_des_ede3_ecb(), NULL, key, NULL) <= 0)
-		goto process_cipher_des3ctr_err;
-
 	memcpy(ctr, iv, 8);
 
 	for (n = 0; n < srclen; n++) {
@@ -1701,8 +1703,7 @@ process_openssl_cipher_op
 					srclen, ctx_copy, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv,
-				sess->cipher.key.data, srclen,
+				op->sym->cipher.data.offset, iv, srclen,
 				ctx_copy);
 
 	EVP_CIPHER_CTX_free(ctx_copy);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
@ 2024-06-03 18:59   ` Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP_CIPHER_CTXs are allocated, copied to (from
openssl_session), and then freed for every cipher operation (ie. per
packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of pointers to per-queue-pair
cipher context copies. These are populated on first use by allocating a
new context and copying from the main context. These copies can then be
used in a thread-safe manner by different worker lcores simultaneously.
Consequently the cipher context allocation and copy only has to happen
once - the first time a given qp uses an openssl_session. This brings
about a large performance boost.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.51 |               2.94 |    94.4% |
|             256 |          4.90 |               8.05 |    64.3% |
|            1024 |         11.07 |              14.21 |    28.3% |
|            2048 |         14.03 |              16.28 |    16.0% |
|            4096 |         16.20 |              17.59 |     8.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          3.05 |              23.74 |   678.8% |
|             256 |         10.46 |              64.86 |   520.3% |
|            1024 |         40.97 |             113.80 |   177.7% |
|            2048 |         73.25 |             130.21 |    77.8% |
|            4096 |        103.89 |             140.62 |    35.4% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/openssl_pmd_private.h | 11 ++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 78 ++++++++++++++------
 drivers/crypto/openssl/rte_openssl_pmd_ops.c | 34 ++++++++-
 3 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index 0f038b218c..bad7dcf2f5 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -166,6 +166,14 @@ struct __rte_cache_aligned openssl_session {
 		/**< digest length */
 	} auth;
 
+	uint16_t ctx_copies_len;
+	/* < number of entries in ctx_copies */
+	EVP_CIPHER_CTX *qp_ctx[];
+	/**< Flexible array member of per-queue-pair pointers to copies of EVP
+	 * context structure. Cipher contexts are not safe to use from multiple
+	 * cores simultaneously, so maintaining these copies allows avoiding
+	 * per-buffer copying into a temporary context.
+	 */
 };
 
 /** OPENSSL crypto private asymmetric session structure */
@@ -217,7 +225,8 @@ struct __rte_cache_aligned openssl_asym_session {
 /** Set and validate OPENSSL crypto session parameters */
 extern int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform);
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs);
 
 /** Reset OPENSSL crypto session parameters */
 extern void
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 175ffda2b9..ebd1cab667 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -788,7 +788,8 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 /** Parse crypto xform chain and set private session parameters */
 int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform)
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs)
 {
 	const struct rte_crypto_sym_xform *cipher_xform = NULL;
 	const struct rte_crypto_sym_xform *auth_xform = NULL;
@@ -850,6 +851,12 @@ openssl_set_session_parameters(struct openssl_session *sess,
 		}
 	}
 
+	/*
+	 * With only one queue pair, the array of copies is not needed.
+	 * Otherwise, one entry per queue pair is required.
+	 */
+	sess->ctx_copies_len = nb_queue_pairs > 1 ? nb_queue_pairs : 0;
+
 	return 0;
 }
 
@@ -857,6 +864,13 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
+		if (sess->qp_ctx[i] != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
+			sess->qp_ctx[i] = NULL;
+		}
+	}
+
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
 	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
@@ -923,7 +937,7 @@ get_session(struct openssl_qp *qp, struct rte_crypto_op *op)
 		sess = (struct openssl_session *)_sess->driver_priv_data;
 
 		if (unlikely(openssl_set_session_parameters(sess,
-				op->sym->xform) != 0)) {
+				op->sym->xform, 1) != 0)) {
 			rte_mempool_put(qp->sess_mp, _sess);
 			sess = NULL;
 		}
@@ -1571,11 +1585,33 @@ process_openssl_auth_cmac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 # endif
 /*----------------------------------------------------------------------------*/
 
+static inline EVP_CIPHER_CTX *
+get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->cipher.ctx;
+
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30200000L
+		/* EVP_CIPHER_CTX_dup() added in OSSL 3.2 */
+		*lctx = EVP_CIPHER_CTX_dup(sess->cipher.ctx);
+#else
+		*lctx = EVP_CIPHER_CTX_new();
+		EVP_CIPHER_CTX_copy(*lctx, sess->cipher.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
-process_openssl_combined_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	/* cipher */
 	uint8_t *dst = NULL, *iv, *tag, *aad;
@@ -1592,8 +1628,7 @@ process_openssl_combined_op
 		return;
 	}
 
-	EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx, sess->cipher.ctx);
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
@@ -1649,8 +1684,6 @@ process_openssl_combined_op
 					dst, tag, taglen, ctx);
 	}
 
-	EVP_CIPHER_CTX_free(ctx);
-
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
@@ -1663,14 +1696,13 @@ process_openssl_combined_op
 
 /** Process cipher operation */
 static void
-process_openssl_cipher_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_cipher_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	uint8_t *dst, *iv;
 	int srclen, status;
 	uint8_t inplace = (mbuf_src == mbuf_dst) ? 1 : 0;
-	EVP_CIPHER_CTX *ctx_copy;
 
 	/*
 	 * Segmented OOP destination buffer is not supported for encryption/
@@ -1689,24 +1721,22 @@ process_openssl_cipher_op
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
-	ctx_copy = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx_copy, sess->cipher.ctx);
+
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	if (sess->cipher.mode == OPENSSL_CIPHER_LIB)
 		if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
 			status = process_openssl_cipher_encrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 		else
 			status = process_openssl_cipher_decrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv, srclen,
-				ctx_copy);
+				op->sym->cipher.data.offset, iv, srclen, ctx);
 
-	EVP_CIPHER_CTX_free(ctx_copy);
 	if (status != 0)
 		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
 }
@@ -3111,13 +3141,13 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->chain_order) {
 	case OPENSSL_CHAIN_ONLY_CIPHER:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_ONLY_AUTH:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_AUTH:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		/* OOP */
 		if (msrc != mdst)
 			copy_plaintext(msrc, mdst, op);
@@ -3125,10 +3155,10 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 		break;
 	case OPENSSL_CHAIN_AUTH_CIPHER:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_COMBINED:
-		process_openssl_combined_op(op, sess, msrc, mdst);
+		process_openssl_combined_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_BPI:
 		process_openssl_docsis_bpi_op(op, sess, msrc, mdst);
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index b16baaa08f..4209c6ab6f 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -794,9 +794,34 @@ openssl_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
 
 /** Returns the size of the symmetric session structure */
 static unsigned
-openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev __rte_unused)
+openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 {
-	return sizeof(struct openssl_session);
+	/*
+	 * For 0 qps, return the max size of the session - this is necessary if
+	 * the user calls into this function to create the session mempool,
+	 * without first configuring the number of qps for the cryptodev.
+	 */
+	if (dev->data->nb_queue_pairs == 0) {
+		unsigned int max_nb_qps = ((struct openssl_private *)
+				dev->data->dev_private)->max_nb_qpairs;
+		return sizeof(struct openssl_session) +
+				(sizeof(void *) * max_nb_qps);
+	}
+
+	/*
+	 * With only one queue pair, the thread safety of multiple context
+	 * copies is not necessary, so don't allocate extra memory for the
+	 * array.
+	 */
+	if (dev->data->nb_queue_pairs == 1)
+		return sizeof(struct openssl_session);
+
+	/*
+	 * Otherwise, the size of the flexible array member should be enough to
+	 * fit pointers to per-qp contexts.
+	 */
+	return sizeof(struct openssl_session) +
+		(sizeof(void *) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
@@ -808,7 +833,7 @@ openssl_pmd_asym_session_get_size(struct rte_cryptodev *dev __rte_unused)
 
 /** Configure the session from a crypto xform chain */
 static int
-openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
+openssl_pmd_sym_session_configure(struct rte_cryptodev *dev,
 		struct rte_crypto_sym_xform *xform,
 		struct rte_cryptodev_sym_session *sess)
 {
@@ -820,7 +845,8 @@ openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
 		return -EINVAL;
 	}
 
-	ret = openssl_set_session_parameters(sess_private_data, xform);
+	ret = openssl_set_session_parameters(sess_private_data, xform,
+			dev->data->nb_queue_pairs);
 	if (ret != 0) {
 		OPENSSL_LOG(ERR, "failed configure session parameters");
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 4/5] crypto/openssl: per-qp auth context clones
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (2 preceding siblings ...)
  2024-06-03 18:59   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
@ 2024-06-03 18:59   ` Jack Bond-Preston
  2024-06-03 18:59   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP auth ctxs (e.g. EVP_MD_CTX, EVP_MAC_CTX) are allocated,
copied to (from openssl_session), and then freed for every auth
operation (ie. per packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of structures, containing
pointers to per-queue-pair cipher and auth context copies. These are
populated on first use by allocating a new context and copying from the
main context. These copies can then be used in a thread-safe manner by
different worker lcores simultaneously. Consequently the auth context
allocation and copy only has to happen once - the first time a given qp
uses an openssl_session. This brings about a large performance boost.

Throughput performance uplift measurements for HMAC-SHA1 generate on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.63 |               1.42 |   123.5% |
|             256 |          2.24 |               4.40 |    96.4% |
|            1024 |          6.15 |               9.26 |    50.6% |
|            2048 |          8.68 |              11.38 |    31.1% |
|            4096 |         10.92 |              12.84 |    17.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.93 |              11.35 |  1122.5% |
|             256 |          3.70 |              35.30 |   853.7% |
|            1024 |         15.22 |              74.27 |   387.8% |
|            2048 |         30.20 |              91.08 |   201.6% |
|            4096 |         56.92 |             102.76 |    80.5% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/compat.h              |  26 ++++
 drivers/crypto/openssl/openssl_pmd_private.h |  25 +++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 144 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |   7 +-
 4 files changed, 161 insertions(+), 41 deletions(-)

diff --git a/drivers/crypto/openssl/compat.h b/drivers/crypto/openssl/compat.h
index 9f9167c4f1..4c5ddfbf3a 100644
--- a/drivers/crypto/openssl/compat.h
+++ b/drivers/crypto/openssl/compat.h
@@ -5,6 +5,32 @@
 #ifndef __RTA_COMPAT_H__
 #define __RTA_COMPAT_H__
 
+#if OPENSSL_VERSION_NUMBER > 0x30000000L
+static __rte_always_inline void
+free_hmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+#else
+static __rte_always_inline void
+free_hmac_ctx(HMAC_CTX *ctx)
+{
+	HMAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(CMAC_CTX *ctx)
+{
+	CMAC_CTX_free(ctx);
+}
+#endif
+
 #if (OPENSSL_VERSION_NUMBER < 0x10100000L)
 
 static __rte_always_inline int
diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index bad7dcf2f5..a50e4d4918 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
 	 */
 };
 
+struct evp_ctx_pair {
+	EVP_CIPHER_CTX *cipher;
+	union {
+		EVP_MD_CTX *auth;
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		EVP_MAC_CTX *hmac;
+		EVP_MAC_CTX *cmac;
+#else
+		HMAC_CTX *hmac;
+		CMAC_CTX *cmac;
+#endif
+	};
+};
+
 /** OPENSSL crypto private session structure */
 struct __rte_cache_aligned openssl_session {
 	enum openssl_chain_order chain_order;
@@ -168,11 +182,12 @@ struct __rte_cache_aligned openssl_session {
 
 	uint16_t ctx_copies_len;
 	/* < number of entries in ctx_copies */
-	EVP_CIPHER_CTX *qp_ctx[];
-	/**< Flexible array member of per-queue-pair pointers to copies of EVP
-	 * context structure. Cipher contexts are not safe to use from multiple
-	 * cores simultaneously, so maintaining these copies allows avoiding
-	 * per-buffer copying into a temporary context.
+	struct evp_ctx_pair qp_ctx[];
+	/**< Flexible array member of per-queue-pair structures, each containing
+	 * pointers to copies of the cipher and auth EVP contexts. Cipher
+	 * contexts are not safe to use from multiple cores simultaneously, so
+	 * maintaining these copies allows avoiding per-buffer copying into a
+	 * temporary context.
 	 */
 };
 
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index ebd1cab667..743b20c5b0 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -864,40 +864,45 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	/* Free all the qp_ctx entries. */
 	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
-		if (sess->qp_ctx[i] != NULL) {
-			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
-			sess->qp_ctx[i] = NULL;
+		if (sess->qp_ctx[i].cipher != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i].cipher);
+			sess->qp_ctx[i].cipher = NULL;
+		}
+
+		switch (sess->auth.mode) {
+		case OPENSSL_AUTH_AS_AUTH:
+			EVP_MD_CTX_destroy(sess->qp_ctx[i].auth);
+			sess->qp_ctx[i].auth = NULL;
+			break;
+		case OPENSSL_AUTH_AS_HMAC:
+			free_hmac_ctx(sess->qp_ctx[i].hmac);
+			sess->qp_ctx[i].hmac = NULL;
+			break;
+		case OPENSSL_AUTH_AS_CMAC:
+			free_cmac_ctx(sess->qp_ctx[i].cmac);
+			sess->qp_ctx[i].cmac = NULL;
+			break;
 		}
 	}
 
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
-	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
-		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
-
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
 		EVP_MD_CTX_destroy(sess->auth.auth.ctx);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
-		EVP_PKEY_free(sess->auth.hmac.pkey);
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.hmac.ctx);
-# else
-		HMAC_CTX_free(sess->auth.hmac.ctx);
-# endif
+		free_hmac_ctx(sess->auth.hmac.ctx);
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.cmac.ctx);
-# else
-		CMAC_CTX_free(sess->auth.cmac.ctx);
-# endif
-		break;
-	default:
+		free_cmac_ctx(sess->auth.cmac.ctx);
 		break;
 	}
+
+	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
+		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
 }
 
 /** Provide session for operation */
@@ -1443,6 +1448,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (m == 0)
 		goto process_auth_err;
 
+	if (EVP_MAC_init(ctx, NULL, 0, NULL) <= 0)
+		goto process_auth_err;
+
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 
 	l = rte_pktmbuf_data_len(m) - offset;
@@ -1469,11 +1477,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (EVP_MAC_final(ctx, dst, &dstlen, DIGEST_LENGTH_MAX) != 1)
 		goto process_auth_err;
 
-	EVP_MAC_CTX_free(ctx);
 	return 0;
 
 process_auth_err:
-	EVP_MAC_CTX_free(ctx);
 	OPENSSL_LOG(ERR, "Process openssl auth failed");
 	return -EINVAL;
 }
@@ -1592,7 +1598,7 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	if (sess->ctx_copies_len == 0)
 		return sess->cipher.ctx;
 
-	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id].cipher;
 
 	if (unlikely(*lctx == NULL)) {
 #if OPENSSL_VERSION_NUMBER >= 0x30200000L
@@ -1607,6 +1613,86 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	return *lctx;
 }
 
+static inline EVP_MD_CTX *
+get_local_auth_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.auth.ctx;
+
+	EVP_MD_CTX **lctx = &sess->qp_ctx[qp->id].auth;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30100000L
+		/* EVP_MD_CTX_dup() added in OSSL 3.1 */
+		*lctx = EVP_MD_CTX_dup(sess->auth.auth.ctx);
+#else
+		*lctx = EVP_MD_CTX_new();
+		EVP_MD_CTX_copy(*lctx, sess->auth.auth.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline HMAC_CTX *
+#endif
+get_local_hmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.hmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	HMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].hmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+		*lctx = HMAC_CTX_new();
+		HMAC_CTX_copy(*lctx, sess->auth.hmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline CMAC_CTX *
+#endif
+get_local_cmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.cmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	CMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].cmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+		*lctx = CMAC_CTX_new();
+		CMAC_CTX_copy(*lctx, sess->auth.cmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
 process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
@@ -1855,41 +1941,33 @@ process_openssl_auth_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
-		ctx_a = EVP_MD_CTX_create();
-		EVP_MD_CTX_copy_ex(ctx_a, sess->auth.auth.ctx);
+		ctx_a = get_local_auth_ctx(sess, qp);
 		status = process_openssl_auth(mbuf_src, dst,
 				op->sym->auth.data.offset, NULL, NULL, srclen,
 				ctx_a, sess->auth.auth.evp_algo);
-		EVP_MD_CTX_destroy(ctx_a);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
+		ctx_h = get_local_hmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_h = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
 # else
-		ctx_h = HMAC_CTX_new();
-		HMAC_CTX_copy(ctx_h, sess->auth.hmac.ctx);
 		status = process_openssl_auth_hmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
-		HMAC_CTX_free(ctx_h);
 # endif
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
+		ctx_c = get_local_cmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_c = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
 # else
-		ctx_c = CMAC_CTX_new();
-		CMAC_CTX_copy(ctx_c, sess->auth.cmac.ctx);
 		status = process_openssl_auth_cmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
-		CMAC_CTX_free(ctx_c);
 # endif
 		break;
 	default:
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index 4209c6ab6f..1bbb855a59 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -805,7 +805,7 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 		unsigned int max_nb_qps = ((struct openssl_private *)
 				dev->data->dev_private)->max_nb_qpairs;
 		return sizeof(struct openssl_session) +
-				(sizeof(void *) * max_nb_qps);
+				(sizeof(struct evp_ctx_pair) * max_nb_qps);
 	}
 
 	/*
@@ -818,10 +818,11 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 
 	/*
 	 * Otherwise, the size of the flexible array member should be enough to
-	 * fit pointers to per-qp contexts.
+	 * fit pointers to per-qp contexts. This is twice the number of queue
+	 * pairs, to allow for auth and cipher contexts.
 	 */
 	return sizeof(struct openssl_session) +
-		(sizeof(void *) * dev->data->nb_queue_pairs);
+		(sizeof(struct evp_ctx_pair) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 5/5] crypto/openssl: only set cipher padding once
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (3 preceding siblings ...)
  2024-06-03 18:59   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-03 18:59   ` Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-03 18:59 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Setting the cipher padding has a noticeable performance footprint,
and it doesn't need to be done for every call to
process_openssl_cipher_{en,de}crypt(). Setting it causes OpenSSL to set
it on every future context re-init. Thus, for every buffer after the
first one, the padding is being set twice.

Instead, just set the cipher padding once - when configuring the session
parameters - avoiding the unnecessary double setting behaviour. This is
skipped for AEAD ciphers, where disabling padding is not necessary.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.97 |               3.72 |    25.2% |
|             256 |          8.10 |               9.42 |    16.3% |
|            1024 |         14.22 |              15.18 |     6.8% |
|            2048 |         16.28 |              16.93 |     4.0% |
|            4096 |         17.58 |              17.97 |     2.2% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         21.27 |              29.85 |    40.3% |
|             256 |         60.05 |              75.53 |    25.8% |
|            1024 |        110.11 |             121.56 |    10.4% |
|            2048 |        128.05 |             135.40 |     5.7% |
|            4096 |        139.45 |             143.76 |     3.1% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 743b20c5b0..f0f5082769 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -595,6 +595,8 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 		return -ENOTSUP;
 	}
 
+	EVP_CIPHER_CTX_set_padding(sess->cipher.ctx, 0);
+
 	return 0;
 }
 
@@ -1096,8 +1098,6 @@ process_openssl_cipher_encrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_encrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_encryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_encrypt_err;
@@ -1146,8 +1146,6 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_DecryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_decrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_decryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_decrypt_err;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 0/5] OpenSSL PMD Optimisations
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (6 preceding siblings ...)
  2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-06 10:20 ` Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
                     ` (4 more replies)
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-24 16:14 ` [PATCH 0/5] OpenSSL PMD Optimisations Ji, Kai
  9 siblings, 5 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  Cc: dev

v2:
* Fixed missing * in patch 4 causing compilation failures.

v3:
* Work around a lack of support for duplicating EVP_CIPHER_CTXs for
  AES-GCM and AES-CCM in OpenSSL versions 3.0.0 <= v < 3.2.0.
---

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 316 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 +-
 4 files changed, 316 insertions(+), 87 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-06 10:20   ` Jack Bond-Preston
  2024-06-06 10:44     ` [EXTERNAL] " Akhil Goyal
  2024-06-06 10:20   ` [PATCH v3 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  To: Kai Ji, Akhil Goyal, Fan Zhang; +Cc: dev, stable, Wathsala Vithanage

Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
contexts") introduced a fix for concurrency bugs which could occur when
using one OpenSSL PMD session across multiple cores simultaneously. The
solution was to clone the EVP contexts per-buffer to avoid them being
used concurrently.

However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
routine with 3.0 EVP API") reverted this fix, only for combined ops
(AES-GCM and AES-CCM).

Fix the concurrency issue by cloning EVP contexts per-buffer. An extra
workaround is required for OpenSSL versions which are >= 3.0.0, and
<= 3.2.0. This is because, prior to OpenSSL 3.2.0, EVP_CIPHER_CTX_copy()
is not implemented for AES-GCM or AES-CCM. When using these OpenSSL
versions, create and initialise the context from scratch, per-buffer.

Throughput performance uplift measurements for AES-GCM-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
Cc: stable@dpdk.org
Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 84 ++++++++++++++++++------
 1 file changed, 64 insertions(+), 20 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index e8cb09defc..3f7f4d8c37 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -350,7 +350,8 @@ get_aead_algo(enum rte_crypto_aead_algorithm sess_algo, size_t keylen,
 static int
 openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 		enum rte_crypto_aead_algorithm algo,
-		uint8_t tag_len, const uint8_t *key)
+		uint8_t tag_len, const uint8_t *key,
+		EVP_CIPHER_CTX **ctx)
 {
 	int iv_type = 0;
 	unsigned int do_ccm;
@@ -378,7 +379,7 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 	}
 
 	sess->cipher.mode = OPENSSL_CIPHER_LIB;
-	sess->cipher.ctx = EVP_CIPHER_CTX_new();
+	*ctx = EVP_CIPHER_CTX_new();
 
 	if (get_aead_algo(algo, sess->cipher.key.length,
 			&sess->cipher.evp_algo) != 0)
@@ -388,19 +389,19 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 
 	sess->chain_order = OPENSSL_CHAIN_COMBINED;
 
-	if (EVP_EncryptInit_ex(sess->cipher.ctx, sess->cipher.evp_algo,
+	if (EVP_EncryptInit_ex(*ctx, sess->cipher.evp_algo,
 			NULL, NULL, NULL) <= 0)
 		return -EINVAL;
 
-	if (EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, iv_type, sess->iv.length,
+	if (EVP_CIPHER_CTX_ctrl(*ctx, iv_type, sess->iv.length,
 			NULL) <= 0)
 		return -EINVAL;
 
 	if (do_ccm)
-		EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, EVP_CTRL_CCM_SET_TAG,
+		EVP_CIPHER_CTX_ctrl(*ctx, EVP_CTRL_CCM_SET_TAG,
 				tag_len, NULL);
 
-	if (EVP_EncryptInit_ex(sess->cipher.ctx, NULL, NULL, key, NULL) <= 0)
+	if (EVP_EncryptInit_ex(*ctx, NULL, NULL, key, NULL) <= 0)
 		return -EINVAL;
 
 	return 0;
@@ -410,7 +411,8 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 static int
 openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 		enum rte_crypto_aead_algorithm algo,
-		uint8_t tag_len, const uint8_t *key)
+		uint8_t tag_len, const uint8_t *key,
+		EVP_CIPHER_CTX **ctx)
 {
 	int iv_type = 0;
 	unsigned int do_ccm = 0;
@@ -437,7 +439,7 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 	}
 
 	sess->cipher.mode = OPENSSL_CIPHER_LIB;
-	sess->cipher.ctx = EVP_CIPHER_CTX_new();
+	*ctx = EVP_CIPHER_CTX_new();
 
 	if (get_aead_algo(algo, sess->cipher.key.length,
 			&sess->cipher.evp_algo) != 0)
@@ -447,24 +449,54 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 
 	sess->chain_order = OPENSSL_CHAIN_COMBINED;
 
-	if (EVP_DecryptInit_ex(sess->cipher.ctx, sess->cipher.evp_algo,
+	if (EVP_DecryptInit_ex(*ctx, sess->cipher.evp_algo,
 			NULL, NULL, NULL) <= 0)
 		return -EINVAL;
 
-	if (EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, iv_type,
+	if (EVP_CIPHER_CTX_ctrl(*ctx, iv_type,
 			sess->iv.length, NULL) <= 0)
 		return -EINVAL;
 
 	if (do_ccm)
-		EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, EVP_CTRL_CCM_SET_TAG,
+		EVP_CIPHER_CTX_ctrl(*ctx, EVP_CTRL_CCM_SET_TAG,
 				tag_len, NULL);
 
-	if (EVP_DecryptInit_ex(sess->cipher.ctx, NULL, NULL, key, NULL) <= 0)
+	if (EVP_DecryptInit_ex(*ctx, NULL, NULL, key, NULL) <= 0)
 		return -EINVAL;
 
 	return 0;
 }
 
+static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
+		struct openssl_session *sess)
+{
+#if (OPENSSL_VERSION_NUMBER > 0x30200000L)
+	*dest = EVP_CIPHER_CTX_dup(sess->ctx);
+	return 0;
+#elif (OPENSSL_VERSION_NUMBER > 0x30000000L)
+	/* OpenSSL versions 3.0.0 <= V < 3.2.0 have no dupctx() implementation
+	 * for AES-GCM and AES-CCM. In this case, we have to create new empty
+	 * contexts and initialise, as we did the original context.
+	 */
+	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC)
+		sess->aead_algo = RTE_CRYPTO_AEAD_AES_GCM;
+
+	if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
+		return openssl_set_sess_aead_enc_param(sess, sess->aead_algo,
+				sess->auth.digest_length, sess->cipher.key.data,
+				dest);
+	else
+		return openssl_set_sess_aead_dec_param(sess, sess->aead_algo,
+				sess->auth.digest_length, sess->cipher.key.data,
+				dest);
+#else
+	*dest = EVP_CIPHER_CTX_new();
+	if (EVP_CIPHER_CTX_copy(*dest, sess->cipher.ctx) != 1)
+		return -EINVAL;
+	return 0;
+#endif
+}
+
 /** Set session cipher parameters */
 static int
 openssl_set_session_cipher_parameters(struct openssl_session *sess,
@@ -623,12 +655,14 @@ openssl_set_session_auth_parameters(struct openssl_session *sess,
 			return openssl_set_sess_aead_enc_param(sess,
 						RTE_CRYPTO_AEAD_AES_GCM,
 						xform->auth.digest_length,
-						xform->auth.key.data);
+						xform->auth.key.data,
+						&sess->cipher.ctx);
 		else
 			return openssl_set_sess_aead_dec_param(sess,
 						RTE_CRYPTO_AEAD_AES_GCM,
 						xform->auth.digest_length,
-						xform->auth.key.data);
+						xform->auth.key.data,
+						&sess->cipher.ctx);
 		break;
 
 	case RTE_CRYPTO_AUTH_MD5:
@@ -770,10 +804,12 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 	/* Select cipher direction */
 	if (xform->aead.op == RTE_CRYPTO_AEAD_OP_ENCRYPT)
 		return openssl_set_sess_aead_enc_param(sess, xform->aead.algo,
-				xform->aead.digest_length, xform->aead.key.data);
+				xform->aead.digest_length, xform->aead.key.data,
+				&sess->cipher.ctx);
 	else
 		return openssl_set_sess_aead_dec_param(sess, xform->aead.algo,
-				xform->aead.digest_length, xform->aead.key.data);
+				xform->aead.digest_length, xform->aead.key.data,
+				&sess->cipher.ctx);
 }
 
 /** Parse crypto xform chain and set private session parameters */
@@ -1590,6 +1626,12 @@ process_openssl_combined_op
 		return;
 	}
 
+	EVP_CIPHER_CTX *ctx;
+	if (openssl_aesni_ctx_clone(&ctx, sess) != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		return;
+	}
+
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
 	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC) {
@@ -1623,12 +1665,12 @@ process_openssl_combined_op
 			status = process_openssl_auth_encryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_encryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 
 	} else {
 		if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC ||
@@ -1636,14 +1678,16 @@ process_openssl_combined_op
 			status = process_openssl_auth_decryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_decryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 	}
 
+	EVP_CIPHER_CTX_free(ctx);
+
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 2/5] crypto/openssl: only init 3DES-CTR key + impl once
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-06 10:20   ` Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently the 3DES-CTR cipher context is initialised for every buffer,
setting the cipher implementation and key - even though for every
buffer in the session these values will be the same.

Change to initialising the cipher context once, before any buffers are
processed, instead.

Throughput performance uplift measurements for 3DES-CTR encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.16 |               0.21 |    35.3% |
|             256 |          0.20 |               0.22 |     9.4% |
|            1024 |          0.22 |               0.23 |     2.3% |
|            2048 |          0.22 |               0.23 |     0.9% |
|            4096 |          0.22 |               0.23 |     0.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.01 |               1.34 |    32.9% |
|             256 |          1.51 |               1.66 |     9.9% |
|            1024 |          1.72 |               1.77 |     2.6% |
|            2048 |          1.76 |               1.78 |     1.1% |
|            4096 |          1.79 |               1.80 |     0.6% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 3f7f4d8c37..70f2069985 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -553,6 +553,15 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 				sess->cipher.key.length,
 				sess->cipher.key.data) != 0)
 			return -EINVAL;
+
+
+		/* We use 3DES encryption also for decryption.
+		 * IV is not important for 3DES ECB.
+		 */
+		if (EVP_EncryptInit_ex(sess->cipher.ctx, EVP_des_ede3_ecb(),
+				NULL, sess->cipher.key.data,  NULL) != 1)
+			return -EINVAL;
+
 		break;
 
 	case RTE_CRYPTO_CIPHER_DES_CBC:
@@ -1172,8 +1181,7 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 /** Process cipher des 3 ctr encryption, decryption algorithm */
 static int
 process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
-		int offset, uint8_t *iv, uint8_t *key, int srclen,
-		EVP_CIPHER_CTX *ctx)
+		int offset, uint8_t *iv, int srclen, EVP_CIPHER_CTX *ctx)
 {
 	uint8_t ebuf[8], ctr[8];
 	int unused, n;
@@ -1191,12 +1199,6 @@ process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 	l = rte_pktmbuf_data_len(m) - offset;
 
-	/* We use 3DES encryption also for decryption.
-	 * IV is not important for 3DES ecb
-	 */
-	if (EVP_EncryptInit_ex(ctx, EVP_des_ede3_ecb(), NULL, key, NULL) <= 0)
-		goto process_cipher_des3ctr_err;
-
 	memcpy(ctr, iv, 8);
 
 	for (n = 0; n < srclen; n++) {
@@ -1740,8 +1742,7 @@ process_openssl_cipher_op
 					srclen, ctx_copy, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv,
-				sess->cipher.key.data, srclen,
+				op->sym->cipher.data.offset, iv, srclen,
 				ctx_copy);
 
 	EVP_CIPHER_CTX_free(ctx_copy);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 3/5] crypto/openssl: per-qp cipher context clones
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
@ 2024-06-06 10:20   ` Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP_CIPHER_CTXs are allocated, copied to (from
openssl_session), and then freed for every cipher operation (ie. per
packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of pointers to per-queue-pair
cipher context copies. These are populated on first use by allocating a
new context and copying from the main context. These copies can then be
used in a thread-safe manner by different worker lcores simultaneously.
Consequently the cipher context allocation and copy only has to happen
once - the first time a given qp uses an openssl_session. This brings
about a large performance boost.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.51 |               2.94 |    94.4% |
|             256 |          4.90 |               8.05 |    64.3% |
|            1024 |         11.07 |              14.21 |    28.3% |
|            2048 |         14.03 |              16.28 |    16.0% |
|            4096 |         16.20 |              17.59 |     8.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          3.05 |              23.74 |   678.8% |
|             256 |         10.46 |              64.86 |   520.3% |
|            1024 |         40.97 |             113.80 |   177.7% |
|            2048 |         73.25 |             130.21 |    77.8% |
|            4096 |        103.89 |             140.62 |    35.4% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/openssl_pmd_private.h |  11 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 105 ++++++++++++-------
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  34 +++++-
 3 files changed, 108 insertions(+), 42 deletions(-)

diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index 0f038b218c..bad7dcf2f5 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -166,6 +166,14 @@ struct __rte_cache_aligned openssl_session {
 		/**< digest length */
 	} auth;
 
+	uint16_t ctx_copies_len;
+	/* < number of entries in ctx_copies */
+	EVP_CIPHER_CTX *qp_ctx[];
+	/**< Flexible array member of per-queue-pair pointers to copies of EVP
+	 * context structure. Cipher contexts are not safe to use from multiple
+	 * cores simultaneously, so maintaining these copies allows avoiding
+	 * per-buffer copying into a temporary context.
+	 */
 };
 
 /** OPENSSL crypto private asymmetric session structure */
@@ -217,7 +225,8 @@ struct __rte_cache_aligned openssl_asym_session {
 /** Set and validate OPENSSL crypto session parameters */
 extern int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform);
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs);
 
 /** Reset OPENSSL crypto session parameters */
 extern void
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 70f2069985..df44cc097e 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -467,13 +467,10 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 	return 0;
 }
 
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30200000L)
 static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
 		struct openssl_session *sess)
 {
-#if (OPENSSL_VERSION_NUMBER > 0x30200000L)
-	*dest = EVP_CIPHER_CTX_dup(sess->ctx);
-	return 0;
-#elif (OPENSSL_VERSION_NUMBER > 0x30000000L)
 	/* OpenSSL versions 3.0.0 <= V < 3.2.0 have no dupctx() implementation
 	 * for AES-GCM and AES-CCM. In this case, we have to create new empty
 	 * contexts and initialise, as we did the original context.
@@ -489,13 +486,8 @@ static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
 		return openssl_set_sess_aead_dec_param(sess, sess->aead_algo,
 				sess->auth.digest_length, sess->cipher.key.data,
 				dest);
-#else
-	*dest = EVP_CIPHER_CTX_new();
-	if (EVP_CIPHER_CTX_copy(*dest, sess->cipher.ctx) != 1)
-		return -EINVAL;
-	return 0;
-#endif
 }
+#endif
 
 /** Set session cipher parameters */
 static int
@@ -824,7 +816,8 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 /** Parse crypto xform chain and set private session parameters */
 int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform)
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs)
 {
 	const struct rte_crypto_sym_xform *cipher_xform = NULL;
 	const struct rte_crypto_sym_xform *auth_xform = NULL;
@@ -886,6 +879,12 @@ openssl_set_session_parameters(struct openssl_session *sess,
 		}
 	}
 
+	/*
+	 * With only one queue pair, the array of copies is not needed.
+	 * Otherwise, one entry per queue pair is required.
+	 */
+	sess->ctx_copies_len = nb_queue_pairs > 1 ? nb_queue_pairs : 0;
+
 	return 0;
 }
 
@@ -893,6 +892,13 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
+		if (sess->qp_ctx[i] != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
+			sess->qp_ctx[i] = NULL;
+		}
+	}
+
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
 	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
@@ -959,7 +965,7 @@ get_session(struct openssl_qp *qp, struct rte_crypto_op *op)
 		sess = (struct openssl_session *)_sess->driver_priv_data;
 
 		if (unlikely(openssl_set_session_parameters(sess,
-				op->sym->xform) != 0)) {
+				op->sym->xform, 1) != 0)) {
 			rte_mempool_put(qp->sess_mp, _sess);
 			sess = NULL;
 		}
@@ -1607,11 +1613,45 @@ process_openssl_auth_cmac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 # endif
 /*----------------------------------------------------------------------------*/
 
+static inline EVP_CIPHER_CTX *
+get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->cipher.ctx;
+
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30200000L
+		/* EVP_CIPHER_CTX_dup() added in OSSL 3.2 */
+		*lctx = EVP_CIPHER_CTX_dup(sess->cipher.ctx);
+		return *lctx;
+#elif OPENSSL_VERSION_NUMBER >= 0x30000000L
+		if (sess->chain_order == OPENSSL_CHAIN_COMBINED) {
+			/* AESNI special-cased to use openssl_aesni_ctx_clone()
+			 * to allow for working around lack of
+			 * EVP_CIPHER_CTX_copy support for 3.0.0 <= OSSL Version
+			 * < 3.2.0.
+			 */
+			if (openssl_aesni_ctx_clone(lctx, sess) != 0)
+				*lctx = NULL;
+			return *lctx;
+		}
+#endif
+
+		*lctx = EVP_CIPHER_CTX_new();
+		EVP_CIPHER_CTX_copy(*lctx, sess->cipher.ctx);
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
-process_openssl_combined_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	/* cipher */
 	uint8_t *dst = NULL, *iv, *tag, *aad;
@@ -1628,11 +1668,7 @@ process_openssl_combined_op
 		return;
 	}
 
-	EVP_CIPHER_CTX *ctx;
-	if (openssl_aesni_ctx_clone(&ctx, sess) != 0) {
-		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
-		return;
-	}
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
@@ -1688,8 +1724,6 @@ process_openssl_combined_op
 					dst, tag, taglen, ctx);
 	}
 
-	EVP_CIPHER_CTX_free(ctx);
-
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
@@ -1702,14 +1736,13 @@ process_openssl_combined_op
 
 /** Process cipher operation */
 static void
-process_openssl_cipher_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_cipher_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	uint8_t *dst, *iv;
 	int srclen, status;
 	uint8_t inplace = (mbuf_src == mbuf_dst) ? 1 : 0;
-	EVP_CIPHER_CTX *ctx_copy;
 
 	/*
 	 * Segmented OOP destination buffer is not supported for encryption/
@@ -1728,24 +1761,22 @@ process_openssl_cipher_op
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
-	ctx_copy = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx_copy, sess->cipher.ctx);
+
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	if (sess->cipher.mode == OPENSSL_CIPHER_LIB)
 		if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
 			status = process_openssl_cipher_encrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 		else
 			status = process_openssl_cipher_decrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv, srclen,
-				ctx_copy);
+				op->sym->cipher.data.offset, iv, srclen, ctx);
 
-	EVP_CIPHER_CTX_free(ctx_copy);
 	if (status != 0)
 		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
 }
@@ -3150,13 +3181,13 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->chain_order) {
 	case OPENSSL_CHAIN_ONLY_CIPHER:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_ONLY_AUTH:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_AUTH:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		/* OOP */
 		if (msrc != mdst)
 			copy_plaintext(msrc, mdst, op);
@@ -3164,10 +3195,10 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 		break;
 	case OPENSSL_CHAIN_AUTH_CIPHER:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_COMBINED:
-		process_openssl_combined_op(op, sess, msrc, mdst);
+		process_openssl_combined_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_BPI:
 		process_openssl_docsis_bpi_op(op, sess, msrc, mdst);
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index b16baaa08f..4209c6ab6f 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -794,9 +794,34 @@ openssl_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
 
 /** Returns the size of the symmetric session structure */
 static unsigned
-openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev __rte_unused)
+openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 {
-	return sizeof(struct openssl_session);
+	/*
+	 * For 0 qps, return the max size of the session - this is necessary if
+	 * the user calls into this function to create the session mempool,
+	 * without first configuring the number of qps for the cryptodev.
+	 */
+	if (dev->data->nb_queue_pairs == 0) {
+		unsigned int max_nb_qps = ((struct openssl_private *)
+				dev->data->dev_private)->max_nb_qpairs;
+		return sizeof(struct openssl_session) +
+				(sizeof(void *) * max_nb_qps);
+	}
+
+	/*
+	 * With only one queue pair, the thread safety of multiple context
+	 * copies is not necessary, so don't allocate extra memory for the
+	 * array.
+	 */
+	if (dev->data->nb_queue_pairs == 1)
+		return sizeof(struct openssl_session);
+
+	/*
+	 * Otherwise, the size of the flexible array member should be enough to
+	 * fit pointers to per-qp contexts.
+	 */
+	return sizeof(struct openssl_session) +
+		(sizeof(void *) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
@@ -808,7 +833,7 @@ openssl_pmd_asym_session_get_size(struct rte_cryptodev *dev __rte_unused)
 
 /** Configure the session from a crypto xform chain */
 static int
-openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
+openssl_pmd_sym_session_configure(struct rte_cryptodev *dev,
 		struct rte_crypto_sym_xform *xform,
 		struct rte_cryptodev_sym_session *sess)
 {
@@ -820,7 +845,8 @@ openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
 		return -EINVAL;
 	}
 
-	ret = openssl_set_session_parameters(sess_private_data, xform);
+	ret = openssl_set_session_parameters(sess_private_data, xform,
+			dev->data->nb_queue_pairs);
 	if (ret != 0) {
 		OPENSSL_LOG(ERR, "failed configure session parameters");
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 4/5] crypto/openssl: per-qp auth context clones
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (2 preceding siblings ...)
  2024-06-06 10:20   ` [PATCH v3 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
@ 2024-06-06 10:20   ` Jack Bond-Preston
  2024-06-06 10:20   ` [PATCH v3 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP auth ctxs (e.g. EVP_MD_CTX, EVP_MAC_CTX) are allocated,
copied to (from openssl_session), and then freed for every auth
operation (ie. per packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of structures, containing
pointers to per-queue-pair cipher and auth context copies. These are
populated on first use by allocating a new context and copying from the
main context. These copies can then be used in a thread-safe manner by
different worker lcores simultaneously. Consequently the auth context
allocation and copy only has to happen once - the first time a given qp
uses an openssl_session. This brings about a large performance boost.

Throughput performance uplift measurements for HMAC-SHA1 generate on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.63 |               1.42 |   123.5% |
|             256 |          2.24 |               4.40 |    96.4% |
|            1024 |          6.15 |               9.26 |    50.6% |
|            2048 |          8.68 |              11.38 |    31.1% |
|            4096 |         10.92 |              12.84 |    17.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.93 |              11.35 |  1122.5% |
|             256 |          3.70 |              35.30 |   853.7% |
|            1024 |         15.22 |              74.27 |   387.8% |
|            2048 |         30.20 |              91.08 |   201.6% |
|            4096 |         56.92 |             102.76 |    80.5% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/compat.h              |  26 ++++
 drivers/crypto/openssl/openssl_pmd_private.h |  25 +++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 144 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |   7 +-
 4 files changed, 161 insertions(+), 41 deletions(-)

diff --git a/drivers/crypto/openssl/compat.h b/drivers/crypto/openssl/compat.h
index 9f9167c4f1..4c5ddfbf3a 100644
--- a/drivers/crypto/openssl/compat.h
+++ b/drivers/crypto/openssl/compat.h
@@ -5,6 +5,32 @@
 #ifndef __RTA_COMPAT_H__
 #define __RTA_COMPAT_H__
 
+#if OPENSSL_VERSION_NUMBER > 0x30000000L
+static __rte_always_inline void
+free_hmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+#else
+static __rte_always_inline void
+free_hmac_ctx(HMAC_CTX *ctx)
+{
+	HMAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(CMAC_CTX *ctx)
+{
+	CMAC_CTX_free(ctx);
+}
+#endif
+
 #if (OPENSSL_VERSION_NUMBER < 0x10100000L)
 
 static __rte_always_inline int
diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index bad7dcf2f5..a50e4d4918 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
 	 */
 };
 
+struct evp_ctx_pair {
+	EVP_CIPHER_CTX *cipher;
+	union {
+		EVP_MD_CTX *auth;
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		EVP_MAC_CTX *hmac;
+		EVP_MAC_CTX *cmac;
+#else
+		HMAC_CTX *hmac;
+		CMAC_CTX *cmac;
+#endif
+	};
+};
+
 /** OPENSSL crypto private session structure */
 struct __rte_cache_aligned openssl_session {
 	enum openssl_chain_order chain_order;
@@ -168,11 +182,12 @@ struct __rte_cache_aligned openssl_session {
 
 	uint16_t ctx_copies_len;
 	/* < number of entries in ctx_copies */
-	EVP_CIPHER_CTX *qp_ctx[];
-	/**< Flexible array member of per-queue-pair pointers to copies of EVP
-	 * context structure. Cipher contexts are not safe to use from multiple
-	 * cores simultaneously, so maintaining these copies allows avoiding
-	 * per-buffer copying into a temporary context.
+	struct evp_ctx_pair qp_ctx[];
+	/**< Flexible array member of per-queue-pair structures, each containing
+	 * pointers to copies of the cipher and auth EVP contexts. Cipher
+	 * contexts are not safe to use from multiple cores simultaneously, so
+	 * maintaining these copies allows avoiding per-buffer copying into a
+	 * temporary context.
 	 */
 };
 
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index df44cc097e..1ee917da5c 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -892,40 +892,45 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	/* Free all the qp_ctx entries. */
 	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
-		if (sess->qp_ctx[i] != NULL) {
-			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
-			sess->qp_ctx[i] = NULL;
+		if (sess->qp_ctx[i].cipher != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i].cipher);
+			sess->qp_ctx[i].cipher = NULL;
+		}
+
+		switch (sess->auth.mode) {
+		case OPENSSL_AUTH_AS_AUTH:
+			EVP_MD_CTX_destroy(sess->qp_ctx[i].auth);
+			sess->qp_ctx[i].auth = NULL;
+			break;
+		case OPENSSL_AUTH_AS_HMAC:
+			free_hmac_ctx(sess->qp_ctx[i].hmac);
+			sess->qp_ctx[i].hmac = NULL;
+			break;
+		case OPENSSL_AUTH_AS_CMAC:
+			free_cmac_ctx(sess->qp_ctx[i].cmac);
+			sess->qp_ctx[i].cmac = NULL;
+			break;
 		}
 	}
 
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
-	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
-		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
-
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
 		EVP_MD_CTX_destroy(sess->auth.auth.ctx);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
-		EVP_PKEY_free(sess->auth.hmac.pkey);
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.hmac.ctx);
-# else
-		HMAC_CTX_free(sess->auth.hmac.ctx);
-# endif
+		free_hmac_ctx(sess->auth.hmac.ctx);
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.cmac.ctx);
-# else
-		CMAC_CTX_free(sess->auth.cmac.ctx);
-# endif
-		break;
-	default:
+		free_cmac_ctx(sess->auth.cmac.ctx);
 		break;
 	}
+
+	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
+		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
 }
 
 /** Provide session for operation */
@@ -1471,6 +1476,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (m == 0)
 		goto process_auth_err;
 
+	if (EVP_MAC_init(ctx, NULL, 0, NULL) <= 0)
+		goto process_auth_err;
+
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 
 	l = rte_pktmbuf_data_len(m) - offset;
@@ -1497,11 +1505,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (EVP_MAC_final(ctx, dst, &dstlen, DIGEST_LENGTH_MAX) != 1)
 		goto process_auth_err;
 
-	EVP_MAC_CTX_free(ctx);
 	return 0;
 
 process_auth_err:
-	EVP_MAC_CTX_free(ctx);
 	OPENSSL_LOG(ERR, "Process openssl auth failed");
 	return -EINVAL;
 }
@@ -1620,7 +1626,7 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	if (sess->ctx_copies_len == 0)
 		return sess->cipher.ctx;
 
-	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id].cipher;
 
 	if (unlikely(*lctx == NULL)) {
 #if OPENSSL_VERSION_NUMBER >= 0x30200000L
@@ -1647,6 +1653,86 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	return *lctx;
 }
 
+static inline EVP_MD_CTX *
+get_local_auth_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.auth.ctx;
+
+	EVP_MD_CTX **lctx = &sess->qp_ctx[qp->id].auth;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30100000L
+		/* EVP_MD_CTX_dup() added in OSSL 3.1 */
+		*lctx = EVP_MD_CTX_dup(sess->auth.auth.ctx);
+#else
+		*lctx = EVP_MD_CTX_new();
+		EVP_MD_CTX_copy(*lctx, sess->auth.auth.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline HMAC_CTX *
+#endif
+get_local_hmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.hmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	HMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].hmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+		*lctx = HMAC_CTX_new();
+		HMAC_CTX_copy(*lctx, sess->auth.hmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline CMAC_CTX *
+#endif
+get_local_cmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.cmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	CMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].cmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+		*lctx = CMAC_CTX_new();
+		CMAC_CTX_copy(*lctx, sess->auth.cmac.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
 process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
@@ -1895,41 +1981,33 @@ process_openssl_auth_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
-		ctx_a = EVP_MD_CTX_create();
-		EVP_MD_CTX_copy_ex(ctx_a, sess->auth.auth.ctx);
+		ctx_a = get_local_auth_ctx(sess, qp);
 		status = process_openssl_auth(mbuf_src, dst,
 				op->sym->auth.data.offset, NULL, NULL, srclen,
 				ctx_a, sess->auth.auth.evp_algo);
-		EVP_MD_CTX_destroy(ctx_a);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
+		ctx_h = get_local_hmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_h = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
 # else
-		ctx_h = HMAC_CTX_new();
-		HMAC_CTX_copy(ctx_h, sess->auth.hmac.ctx);
 		status = process_openssl_auth_hmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
-		HMAC_CTX_free(ctx_h);
 # endif
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
+		ctx_c = get_local_cmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_c = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
 # else
-		ctx_c = CMAC_CTX_new();
-		CMAC_CTX_copy(ctx_c, sess->auth.cmac.ctx);
 		status = process_openssl_auth_cmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
-		CMAC_CTX_free(ctx_c);
 # endif
 		break;
 	default:
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index 4209c6ab6f..1bbb855a59 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -805,7 +805,7 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 		unsigned int max_nb_qps = ((struct openssl_private *)
 				dev->data->dev_private)->max_nb_qpairs;
 		return sizeof(struct openssl_session) +
-				(sizeof(void *) * max_nb_qps);
+				(sizeof(struct evp_ctx_pair) * max_nb_qps);
 	}
 
 	/*
@@ -818,10 +818,11 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 
 	/*
 	 * Otherwise, the size of the flexible array member should be enough to
-	 * fit pointers to per-qp contexts.
+	 * fit pointers to per-qp contexts. This is twice the number of queue
+	 * pairs, to allow for auth and cipher contexts.
 	 */
 	return sizeof(struct openssl_session) +
-		(sizeof(void *) * dev->data->nb_queue_pairs);
+		(sizeof(struct evp_ctx_pair) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 5/5] crypto/openssl: only set cipher padding once
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (3 preceding siblings ...)
  2024-06-06 10:20   ` [PATCH v3 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-06 10:20   ` Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-06 10:20 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Setting the cipher padding has a noticeable performance footprint,
and it doesn't need to be done for every call to
process_openssl_cipher_{en,de}crypt(). Setting it causes OpenSSL to set
it on every future context re-init. Thus, for every buffer after the
first one, the padding is being set twice.

Instead, just set the cipher padding once - when configuring the session
parameters - avoiding the unnecessary double setting behaviour. This is
skipped for AEAD ciphers, where disabling padding is not necessary.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.97 |               3.72 |    25.2% |
|             256 |          8.10 |               9.42 |    16.3% |
|            1024 |         14.22 |              15.18 |     6.8% |
|            2048 |         16.28 |              16.93 |     4.0% |
|            4096 |         17.58 |              17.97 |     2.2% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         21.27 |              29.85 |    40.3% |
|             256 |         60.05 |              75.53 |    25.8% |
|            1024 |        110.11 |             121.56 |    10.4% |
|            2048 |        128.05 |             135.40 |     5.7% |
|            4096 |        139.45 |             143.76 |     3.1% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 1ee917da5c..264f00126d 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -619,6 +619,8 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 		return -ENOTSUP;
 	}
 
+	EVP_CIPHER_CTX_set_padding(sess->cipher.ctx, 0);
+
 	return 0;
 }
 
@@ -1124,8 +1126,6 @@ process_openssl_cipher_encrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_encrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_encryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_encrypt_err;
@@ -1174,8 +1174,6 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_DecryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_decrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_decryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_decrypt_err;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [EXTERNAL] [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-06 10:20   ` [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-06 10:44     ` Akhil Goyal
  0 siblings, 0 replies; 34+ messages in thread
From: Akhil Goyal @ 2024-06-06 10:44 UTC (permalink / raw)
  To: Jack Bond-Preston, Kai Ji, Fan Zhang; +Cc: dev, stable, Wathsala Vithanage

> Subject: [EXTERNAL] [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread
> unsafe ctxs
> 
> Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
> contexts") introduced a fix for concurrency bugs which could occur when
> using one OpenSSL PMD session across multiple cores simultaneously. The
> solution was to clone the EVP contexts per-buffer to avoid them being
> used concurrently.
> 
> However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
> routine with 3.0 EVP API") reverted this fix, only for combined ops
> (AES-GCM and AES-CCM).
> 
> Fix the concurrency issue by cloning EVP contexts per-buffer. An extra
> workaround is required for OpenSSL versions which are >= 3.0.0, and
> <= 3.2.0. This is because, prior to OpenSSL 3.2.0, EVP_CIPHER_CTX_copy()
> is not implemented for AES-GCM or AES-CCM. When using these OpenSSL
> versions, create and initialise the context from scratch, per-buffer.
> 
> Throughput performance uplift measurements for AES-GCM-128 encrypt on
> Ampere Altra Max platform:
> 1 worker lcore
> |   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
> |-----------------+---------------+--------------------+----------|
> |              64 |          2.60 |               1.31 |   -49.5% |
> |             256 |          7.69 |               4.45 |   -42.1% |
> |            1024 |         15.33 |              11.30 |   -26.3% |
> |            2048 |         18.74 |              15.37 |   -18.0% |
> |            4096 |         21.11 |              18.80 |   -10.9% |
> 
> 8 worker lcores
> |   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
> |-----------------+---------------+--------------------+----------|
> |              64 |         19.94 |               2.83 |   -85.8% |
> |             256 |         58.84 |              11.00 |   -81.3% |
> |            1024 |        119.71 |              42.46 |   -64.5% |
> |            2048 |        147.69 |              80.91 |   -45.2% |
> |            4096 |        167.39 |             121.25 |   -27.6% |
> 
> Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
> Cc: stable@dpdk.org
> Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
> Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Hi Kai,

Please review these optimizations.

Regards,
Akhil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 0/5] OpenSSL PMD Optimisations
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (7 preceding siblings ...)
  2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-07 12:47 ` Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
                     ` (4 more replies)
  2024-06-24 16:14 ` [PATCH 0/5] OpenSSL PMD Optimisations Ji, Kai
  9 siblings, 5 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  Cc: dev

v2:
* Fixed missing * in patch 4 causing compilation failures.

v3:
* Work around a lack of support for duplicating EVP_CIPHER_CTXs for
  AES-GCM and AES-CCM in OpenSSL versions 3.0.0 <= v < 3.2.0.

v4:
* Work around a bug with re-initing EVP_MAC_CTXs in OpenSSL versions
  3.0.0 <= v < 3.0.3.
---

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 348 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 +-
 4 files changed, 348 insertions(+), 87 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-07 12:47   ` Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  To: Kai Ji, Fan Zhang, Akhil Goyal; +Cc: dev, stable, Wathsala Vithanage

Commit 67ab783b5d70 ("crypto/openssl: use local copy for session
contexts") introduced a fix for concurrency bugs which could occur when
using one OpenSSL PMD session across multiple cores simultaneously. The
solution was to clone the EVP contexts per-buffer to avoid them being
used concurrently.

However, part of commit 75adf1eae44f ("crypto/openssl: update HMAC
routine with 3.0 EVP API") reverted this fix, only for combined ops
(AES-GCM and AES-CCM).

Fix the concurrency issue by cloning EVP contexts per-buffer. An extra
workaround is required for OpenSSL versions which are >= 3.0.0, and
<= 3.2.0. This is because, prior to OpenSSL 3.2.0, EVP_CIPHER_CTX_copy()
is not implemented for AES-GCM or AES-CCM. When using these OpenSSL
versions, create and initialise the context from scratch, per-buffer.

Throughput performance uplift measurements for AES-GCM-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

Fixes: 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API")
Cc: stable@dpdk.org
Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 84 ++++++++++++++++++------
 1 file changed, 64 insertions(+), 20 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index e8cb09defc..3f7f4d8c37 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -350,7 +350,8 @@ get_aead_algo(enum rte_crypto_aead_algorithm sess_algo, size_t keylen,
 static int
 openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 		enum rte_crypto_aead_algorithm algo,
-		uint8_t tag_len, const uint8_t *key)
+		uint8_t tag_len, const uint8_t *key,
+		EVP_CIPHER_CTX **ctx)
 {
 	int iv_type = 0;
 	unsigned int do_ccm;
@@ -378,7 +379,7 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 	}
 
 	sess->cipher.mode = OPENSSL_CIPHER_LIB;
-	sess->cipher.ctx = EVP_CIPHER_CTX_new();
+	*ctx = EVP_CIPHER_CTX_new();
 
 	if (get_aead_algo(algo, sess->cipher.key.length,
 			&sess->cipher.evp_algo) != 0)
@@ -388,19 +389,19 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 
 	sess->chain_order = OPENSSL_CHAIN_COMBINED;
 
-	if (EVP_EncryptInit_ex(sess->cipher.ctx, sess->cipher.evp_algo,
+	if (EVP_EncryptInit_ex(*ctx, sess->cipher.evp_algo,
 			NULL, NULL, NULL) <= 0)
 		return -EINVAL;
 
-	if (EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, iv_type, sess->iv.length,
+	if (EVP_CIPHER_CTX_ctrl(*ctx, iv_type, sess->iv.length,
 			NULL) <= 0)
 		return -EINVAL;
 
 	if (do_ccm)
-		EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, EVP_CTRL_CCM_SET_TAG,
+		EVP_CIPHER_CTX_ctrl(*ctx, EVP_CTRL_CCM_SET_TAG,
 				tag_len, NULL);
 
-	if (EVP_EncryptInit_ex(sess->cipher.ctx, NULL, NULL, key, NULL) <= 0)
+	if (EVP_EncryptInit_ex(*ctx, NULL, NULL, key, NULL) <= 0)
 		return -EINVAL;
 
 	return 0;
@@ -410,7 +411,8 @@ openssl_set_sess_aead_enc_param(struct openssl_session *sess,
 static int
 openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 		enum rte_crypto_aead_algorithm algo,
-		uint8_t tag_len, const uint8_t *key)
+		uint8_t tag_len, const uint8_t *key,
+		EVP_CIPHER_CTX **ctx)
 {
 	int iv_type = 0;
 	unsigned int do_ccm = 0;
@@ -437,7 +439,7 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 	}
 
 	sess->cipher.mode = OPENSSL_CIPHER_LIB;
-	sess->cipher.ctx = EVP_CIPHER_CTX_new();
+	*ctx = EVP_CIPHER_CTX_new();
 
 	if (get_aead_algo(algo, sess->cipher.key.length,
 			&sess->cipher.evp_algo) != 0)
@@ -447,24 +449,54 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 
 	sess->chain_order = OPENSSL_CHAIN_COMBINED;
 
-	if (EVP_DecryptInit_ex(sess->cipher.ctx, sess->cipher.evp_algo,
+	if (EVP_DecryptInit_ex(*ctx, sess->cipher.evp_algo,
 			NULL, NULL, NULL) <= 0)
 		return -EINVAL;
 
-	if (EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, iv_type,
+	if (EVP_CIPHER_CTX_ctrl(*ctx, iv_type,
 			sess->iv.length, NULL) <= 0)
 		return -EINVAL;
 
 	if (do_ccm)
-		EVP_CIPHER_CTX_ctrl(sess->cipher.ctx, EVP_CTRL_CCM_SET_TAG,
+		EVP_CIPHER_CTX_ctrl(*ctx, EVP_CTRL_CCM_SET_TAG,
 				tag_len, NULL);
 
-	if (EVP_DecryptInit_ex(sess->cipher.ctx, NULL, NULL, key, NULL) <= 0)
+	if (EVP_DecryptInit_ex(*ctx, NULL, NULL, key, NULL) <= 0)
 		return -EINVAL;
 
 	return 0;
 }
 
+static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
+		struct openssl_session *sess)
+{
+#if (OPENSSL_VERSION_NUMBER > 0x30200000L)
+	*dest = EVP_CIPHER_CTX_dup(sess->ctx);
+	return 0;
+#elif (OPENSSL_VERSION_NUMBER > 0x30000000L)
+	/* OpenSSL versions 3.0.0 <= V < 3.2.0 have no dupctx() implementation
+	 * for AES-GCM and AES-CCM. In this case, we have to create new empty
+	 * contexts and initialise, as we did the original context.
+	 */
+	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC)
+		sess->aead_algo = RTE_CRYPTO_AEAD_AES_GCM;
+
+	if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
+		return openssl_set_sess_aead_enc_param(sess, sess->aead_algo,
+				sess->auth.digest_length, sess->cipher.key.data,
+				dest);
+	else
+		return openssl_set_sess_aead_dec_param(sess, sess->aead_algo,
+				sess->auth.digest_length, sess->cipher.key.data,
+				dest);
+#else
+	*dest = EVP_CIPHER_CTX_new();
+	if (EVP_CIPHER_CTX_copy(*dest, sess->cipher.ctx) != 1)
+		return -EINVAL;
+	return 0;
+#endif
+}
+
 /** Set session cipher parameters */
 static int
 openssl_set_session_cipher_parameters(struct openssl_session *sess,
@@ -623,12 +655,14 @@ openssl_set_session_auth_parameters(struct openssl_session *sess,
 			return openssl_set_sess_aead_enc_param(sess,
 						RTE_CRYPTO_AEAD_AES_GCM,
 						xform->auth.digest_length,
-						xform->auth.key.data);
+						xform->auth.key.data,
+						&sess->cipher.ctx);
 		else
 			return openssl_set_sess_aead_dec_param(sess,
 						RTE_CRYPTO_AEAD_AES_GCM,
 						xform->auth.digest_length,
-						xform->auth.key.data);
+						xform->auth.key.data,
+						&sess->cipher.ctx);
 		break;
 
 	case RTE_CRYPTO_AUTH_MD5:
@@ -770,10 +804,12 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 	/* Select cipher direction */
 	if (xform->aead.op == RTE_CRYPTO_AEAD_OP_ENCRYPT)
 		return openssl_set_sess_aead_enc_param(sess, xform->aead.algo,
-				xform->aead.digest_length, xform->aead.key.data);
+				xform->aead.digest_length, xform->aead.key.data,
+				&sess->cipher.ctx);
 	else
 		return openssl_set_sess_aead_dec_param(sess, xform->aead.algo,
-				xform->aead.digest_length, xform->aead.key.data);
+				xform->aead.digest_length, xform->aead.key.data,
+				&sess->cipher.ctx);
 }
 
 /** Parse crypto xform chain and set private session parameters */
@@ -1590,6 +1626,12 @@ process_openssl_combined_op
 		return;
 	}
 
+	EVP_CIPHER_CTX *ctx;
+	if (openssl_aesni_ctx_clone(&ctx, sess) != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		return;
+	}
+
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
 	if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC) {
@@ -1623,12 +1665,12 @@ process_openssl_combined_op
 			status = process_openssl_auth_encryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_encryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 
 	} else {
 		if (sess->auth.algo == RTE_CRYPTO_AUTH_AES_GMAC ||
@@ -1636,14 +1678,16 @@ process_openssl_combined_op
 			status = process_openssl_auth_decryption_gcm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, sess->cipher.ctx);
+					dst, tag, ctx);
 		else
 			status = process_openssl_auth_decryption_ccm(
 					mbuf_src, offset, srclen,
 					aad, aadlen, iv,
-					dst, tag, taglen, sess->cipher.ctx);
+					dst, tag, taglen, ctx);
 	}
 
+	EVP_CIPHER_CTX_free(ctx);
+
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 2/5] crypto/openssl: only init 3DES-CTR key + impl once
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
@ 2024-06-07 12:47   ` Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently the 3DES-CTR cipher context is initialised for every buffer,
setting the cipher implementation and key - even though for every
buffer in the session these values will be the same.

Change to initialising the cipher context once, before any buffers are
processed, instead.

Throughput performance uplift measurements for 3DES-CTR encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.16 |               0.21 |    35.3% |
|             256 |          0.20 |               0.22 |     9.4% |
|            1024 |          0.22 |               0.23 |     2.3% |
|            2048 |          0.22 |               0.23 |     0.9% |
|            4096 |          0.22 |               0.23 |     0.9% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.01 |               1.34 |    32.9% |
|             256 |          1.51 |               1.66 |     9.9% |
|            1024 |          1.72 |               1.77 |     2.6% |
|            2048 |          1.76 |               1.78 |     1.1% |
|            4096 |          1.79 |               1.80 |     0.6% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 3f7f4d8c37..70f2069985 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -553,6 +553,15 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 				sess->cipher.key.length,
 				sess->cipher.key.data) != 0)
 			return -EINVAL;
+
+
+		/* We use 3DES encryption also for decryption.
+		 * IV is not important for 3DES ECB.
+		 */
+		if (EVP_EncryptInit_ex(sess->cipher.ctx, EVP_des_ede3_ecb(),
+				NULL, sess->cipher.key.data,  NULL) != 1)
+			return -EINVAL;
+
 		break;
 
 	case RTE_CRYPTO_CIPHER_DES_CBC:
@@ -1172,8 +1181,7 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 /** Process cipher des 3 ctr encryption, decryption algorithm */
 static int
 process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
-		int offset, uint8_t *iv, uint8_t *key, int srclen,
-		EVP_CIPHER_CTX *ctx)
+		int offset, uint8_t *iv, int srclen, EVP_CIPHER_CTX *ctx)
 {
 	uint8_t ebuf[8], ctr[8];
 	int unused, n;
@@ -1191,12 +1199,6 @@ process_openssl_cipher_des3ctr(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 	l = rte_pktmbuf_data_len(m) - offset;
 
-	/* We use 3DES encryption also for decryption.
-	 * IV is not important for 3DES ecb
-	 */
-	if (EVP_EncryptInit_ex(ctx, EVP_des_ede3_ecb(), NULL, key, NULL) <= 0)
-		goto process_cipher_des3ctr_err;
-
 	memcpy(ctr, iv, 8);
 
 	for (n = 0; n < srclen; n++) {
@@ -1740,8 +1742,7 @@ process_openssl_cipher_op
 					srclen, ctx_copy, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv,
-				sess->cipher.key.data, srclen,
+				op->sym->cipher.data.offset, iv, srclen,
 				ctx_copy);
 
 	EVP_CIPHER_CTX_free(ctx_copy);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 3/5] crypto/openssl: per-qp cipher context clones
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
@ 2024-06-07 12:47   ` Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP_CIPHER_CTXs are allocated, copied to (from
openssl_session), and then freed for every cipher operation (ie. per
packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of pointers to per-queue-pair
cipher context copies. These are populated on first use by allocating a
new context and copying from the main context. These copies can then be
used in a thread-safe manner by different worker lcores simultaneously.
Consequently the cipher context allocation and copy only has to happen
once - the first time a given qp uses an openssl_session. This brings
about a large performance boost.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          1.51 |               2.94 |    94.4% |
|             256 |          4.90 |               8.05 |    64.3% |
|            1024 |         11.07 |              14.21 |    28.3% |
|            2048 |         14.03 |              16.28 |    16.0% |
|            4096 |         16.20 |              17.59 |     8.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          3.05 |              23.74 |   678.8% |
|             256 |         10.46 |              64.86 |   520.3% |
|            1024 |         40.97 |             113.80 |   177.7% |
|            2048 |         73.25 |             130.21 |    77.8% |
|            4096 |        103.89 |             140.62 |    35.4% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/openssl_pmd_private.h |  11 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 105 ++++++++++++-------
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  34 +++++-
 3 files changed, 108 insertions(+), 42 deletions(-)

diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index 0f038b218c..bad7dcf2f5 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -166,6 +166,14 @@ struct __rte_cache_aligned openssl_session {
 		/**< digest length */
 	} auth;
 
+	uint16_t ctx_copies_len;
+	/* < number of entries in ctx_copies */
+	EVP_CIPHER_CTX *qp_ctx[];
+	/**< Flexible array member of per-queue-pair pointers to copies of EVP
+	 * context structure. Cipher contexts are not safe to use from multiple
+	 * cores simultaneously, so maintaining these copies allows avoiding
+	 * per-buffer copying into a temporary context.
+	 */
 };
 
 /** OPENSSL crypto private asymmetric session structure */
@@ -217,7 +225,8 @@ struct __rte_cache_aligned openssl_asym_session {
 /** Set and validate OPENSSL crypto session parameters */
 extern int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform);
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs);
 
 /** Reset OPENSSL crypto session parameters */
 extern void
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 70f2069985..df44cc097e 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -467,13 +467,10 @@ openssl_set_sess_aead_dec_param(struct openssl_session *sess,
 	return 0;
 }
 
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30200000L)
 static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
 		struct openssl_session *sess)
 {
-#if (OPENSSL_VERSION_NUMBER > 0x30200000L)
-	*dest = EVP_CIPHER_CTX_dup(sess->ctx);
-	return 0;
-#elif (OPENSSL_VERSION_NUMBER > 0x30000000L)
 	/* OpenSSL versions 3.0.0 <= V < 3.2.0 have no dupctx() implementation
 	 * for AES-GCM and AES-CCM. In this case, we have to create new empty
 	 * contexts and initialise, as we did the original context.
@@ -489,13 +486,8 @@ static int openssl_aesni_ctx_clone(EVP_CIPHER_CTX **dest,
 		return openssl_set_sess_aead_dec_param(sess, sess->aead_algo,
 				sess->auth.digest_length, sess->cipher.key.data,
 				dest);
-#else
-	*dest = EVP_CIPHER_CTX_new();
-	if (EVP_CIPHER_CTX_copy(*dest, sess->cipher.ctx) != 1)
-		return -EINVAL;
-	return 0;
-#endif
 }
+#endif
 
 /** Set session cipher parameters */
 static int
@@ -824,7 +816,8 @@ openssl_set_session_aead_parameters(struct openssl_session *sess,
 /** Parse crypto xform chain and set private session parameters */
 int
 openssl_set_session_parameters(struct openssl_session *sess,
-		const struct rte_crypto_sym_xform *xform)
+		const struct rte_crypto_sym_xform *xform,
+		uint16_t nb_queue_pairs)
 {
 	const struct rte_crypto_sym_xform *cipher_xform = NULL;
 	const struct rte_crypto_sym_xform *auth_xform = NULL;
@@ -886,6 +879,12 @@ openssl_set_session_parameters(struct openssl_session *sess,
 		}
 	}
 
+	/*
+	 * With only one queue pair, the array of copies is not needed.
+	 * Otherwise, one entry per queue pair is required.
+	 */
+	sess->ctx_copies_len = nb_queue_pairs > 1 ? nb_queue_pairs : 0;
+
 	return 0;
 }
 
@@ -893,6 +892,13 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
+		if (sess->qp_ctx[i] != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
+			sess->qp_ctx[i] = NULL;
+		}
+	}
+
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
 	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
@@ -959,7 +965,7 @@ get_session(struct openssl_qp *qp, struct rte_crypto_op *op)
 		sess = (struct openssl_session *)_sess->driver_priv_data;
 
 		if (unlikely(openssl_set_session_parameters(sess,
-				op->sym->xform) != 0)) {
+				op->sym->xform, 1) != 0)) {
 			rte_mempool_put(qp->sess_mp, _sess);
 			sess = NULL;
 		}
@@ -1607,11 +1613,45 @@ process_openssl_auth_cmac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 # endif
 /*----------------------------------------------------------------------------*/
 
+static inline EVP_CIPHER_CTX *
+get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->cipher.ctx;
+
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30200000L
+		/* EVP_CIPHER_CTX_dup() added in OSSL 3.2 */
+		*lctx = EVP_CIPHER_CTX_dup(sess->cipher.ctx);
+		return *lctx;
+#elif OPENSSL_VERSION_NUMBER >= 0x30000000L
+		if (sess->chain_order == OPENSSL_CHAIN_COMBINED) {
+			/* AESNI special-cased to use openssl_aesni_ctx_clone()
+			 * to allow for working around lack of
+			 * EVP_CIPHER_CTX_copy support for 3.0.0 <= OSSL Version
+			 * < 3.2.0.
+			 */
+			if (openssl_aesni_ctx_clone(lctx, sess) != 0)
+				*lctx = NULL;
+			return *lctx;
+		}
+#endif
+
+		*lctx = EVP_CIPHER_CTX_new();
+		EVP_CIPHER_CTX_copy(*lctx, sess->cipher.ctx);
+	}
+
+	return *lctx;
+}
+
 /** Process auth/cipher combined operation */
 static void
-process_openssl_combined_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	/* cipher */
 	uint8_t *dst = NULL, *iv, *tag, *aad;
@@ -1628,11 +1668,7 @@ process_openssl_combined_op
 		return;
 	}
 
-	EVP_CIPHER_CTX *ctx;
-	if (openssl_aesni_ctx_clone(&ctx, sess) != 0) {
-		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
-		return;
-	}
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
@@ -1688,8 +1724,6 @@ process_openssl_combined_op
 					dst, tag, taglen, ctx);
 	}
 
-	EVP_CIPHER_CTX_free(ctx);
-
 	if (status != 0) {
 		if (status == (-EFAULT) &&
 				sess->auth.operation ==
@@ -1702,14 +1736,13 @@ process_openssl_combined_op
 
 /** Process cipher operation */
 static void
-process_openssl_cipher_op
-		(struct rte_crypto_op *op, struct openssl_session *sess,
-		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+process_openssl_cipher_op(struct openssl_qp *qp, struct rte_crypto_op *op,
+		struct openssl_session *sess, struct rte_mbuf *mbuf_src,
+		struct rte_mbuf *mbuf_dst)
 {
 	uint8_t *dst, *iv;
 	int srclen, status;
 	uint8_t inplace = (mbuf_src == mbuf_dst) ? 1 : 0;
-	EVP_CIPHER_CTX *ctx_copy;
 
 	/*
 	 * Segmented OOP destination buffer is not supported for encryption/
@@ -1728,24 +1761,22 @@ process_openssl_cipher_op
 
 	iv = rte_crypto_op_ctod_offset(op, uint8_t *,
 			sess->iv.offset);
-	ctx_copy = EVP_CIPHER_CTX_new();
-	EVP_CIPHER_CTX_copy(ctx_copy, sess->cipher.ctx);
+
+	EVP_CIPHER_CTX *ctx = get_local_cipher_ctx(sess, qp);
 
 	if (sess->cipher.mode == OPENSSL_CIPHER_LIB)
 		if (sess->cipher.direction == RTE_CRYPTO_CIPHER_OP_ENCRYPT)
 			status = process_openssl_cipher_encrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 		else
 			status = process_openssl_cipher_decrypt(mbuf_src, dst,
 					op->sym->cipher.data.offset, iv,
-					srclen, ctx_copy, inplace);
+					srclen, ctx, inplace);
 	else
 		status = process_openssl_cipher_des3ctr(mbuf_src, dst,
-				op->sym->cipher.data.offset, iv, srclen,
-				ctx_copy);
+				op->sym->cipher.data.offset, iv, srclen, ctx);
 
-	EVP_CIPHER_CTX_free(ctx_copy);
 	if (status != 0)
 		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
 }
@@ -3150,13 +3181,13 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->chain_order) {
 	case OPENSSL_CHAIN_ONLY_CIPHER:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_ONLY_AUTH:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_AUTH:
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		/* OOP */
 		if (msrc != mdst)
 			copy_plaintext(msrc, mdst, op);
@@ -3164,10 +3195,10 @@ process_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 		break;
 	case OPENSSL_CHAIN_AUTH_CIPHER:
 		process_openssl_auth_op(qp, op, sess, msrc, mdst);
-		process_openssl_cipher_op(op, sess, msrc, mdst);
+		process_openssl_cipher_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_COMBINED:
-		process_openssl_combined_op(op, sess, msrc, mdst);
+		process_openssl_combined_op(qp, op, sess, msrc, mdst);
 		break;
 	case OPENSSL_CHAIN_CIPHER_BPI:
 		process_openssl_docsis_bpi_op(op, sess, msrc, mdst);
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index b16baaa08f..4209c6ab6f 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -794,9 +794,34 @@ openssl_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
 
 /** Returns the size of the symmetric session structure */
 static unsigned
-openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev __rte_unused)
+openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 {
-	return sizeof(struct openssl_session);
+	/*
+	 * For 0 qps, return the max size of the session - this is necessary if
+	 * the user calls into this function to create the session mempool,
+	 * without first configuring the number of qps for the cryptodev.
+	 */
+	if (dev->data->nb_queue_pairs == 0) {
+		unsigned int max_nb_qps = ((struct openssl_private *)
+				dev->data->dev_private)->max_nb_qpairs;
+		return sizeof(struct openssl_session) +
+				(sizeof(void *) * max_nb_qps);
+	}
+
+	/*
+	 * With only one queue pair, the thread safety of multiple context
+	 * copies is not necessary, so don't allocate extra memory for the
+	 * array.
+	 */
+	if (dev->data->nb_queue_pairs == 1)
+		return sizeof(struct openssl_session);
+
+	/*
+	 * Otherwise, the size of the flexible array member should be enough to
+	 * fit pointers to per-qp contexts.
+	 */
+	return sizeof(struct openssl_session) +
+		(sizeof(void *) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
@@ -808,7 +833,7 @@ openssl_pmd_asym_session_get_size(struct rte_cryptodev *dev __rte_unused)
 
 /** Configure the session from a crypto xform chain */
 static int
-openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
+openssl_pmd_sym_session_configure(struct rte_cryptodev *dev,
 		struct rte_crypto_sym_xform *xform,
 		struct rte_cryptodev_sym_session *sess)
 {
@@ -820,7 +845,8 @@ openssl_pmd_sym_session_configure(struct rte_cryptodev *dev __rte_unused,
 		return -EINVAL;
 	}
 
-	ret = openssl_set_session_parameters(sess_private_data, xform);
+	ret = openssl_set_session_parameters(sess_private_data, xform,
+			dev->data->nb_queue_pairs);
 	if (ret != 0) {
 		OPENSSL_LOG(ERR, "failed configure session parameters");
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 4/5] crypto/openssl: per-qp auth context clones
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (2 preceding siblings ...)
  2024-06-07 12:47   ` [PATCH v4 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
@ 2024-06-07 12:47   ` Jack Bond-Preston
  2024-06-07 12:47   ` [PATCH v4 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Currently EVP auth ctxs (e.g. EVP_MD_CTX, EVP_MAC_CTX) are allocated,
copied to (from openssl_session), and then freed for every auth
operation (ie. per packet). This is very inefficient, and avoidable.

Make each openssl_session hold an array of structures, containing
pointers to per-queue-pair cipher and auth context copies. These are
populated on first use by allocating a new context and copying from the
main context. These copies can then be used in a thread-safe manner by
different worker lcores simultaneously. Consequently the auth context
allocation and copy only has to happen once - the first time a given qp
uses an openssl_session. This brings about a large performance boost.

Throughput performance uplift measurements for HMAC-SHA1 generate on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.63 |               1.42 |   123.5% |
|             256 |          2.24 |               4.40 |    96.4% |
|            1024 |          6.15 |               9.26 |    50.6% |
|            2048 |          8.68 |              11.38 |    31.1% |
|            4096 |         10.92 |              12.84 |    17.6% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          0.93 |              11.35 |  1122.5% |
|             256 |          3.70 |              35.30 |   853.7% |
|            1024 |         15.22 |              74.27 |   387.8% |
|            2048 |         30.20 |              91.08 |   201.6% |
|            4096 |         56.92 |             102.76 |    80.5% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/compat.h              |  26 +++
 drivers/crypto/openssl/openssl_pmd_private.h |  25 ++-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 176 +++++++++++++++----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |   7 +-
 4 files changed, 193 insertions(+), 41 deletions(-)

diff --git a/drivers/crypto/openssl/compat.h b/drivers/crypto/openssl/compat.h
index 9f9167c4f1..4c5ddfbf3a 100644
--- a/drivers/crypto/openssl/compat.h
+++ b/drivers/crypto/openssl/compat.h
@@ -5,6 +5,32 @@
 #ifndef __RTA_COMPAT_H__
 #define __RTA_COMPAT_H__
 
+#if OPENSSL_VERSION_NUMBER > 0x30000000L
+static __rte_always_inline void
+free_hmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(EVP_MAC_CTX *ctx)
+{
+	EVP_MAC_CTX_free(ctx);
+}
+#else
+static __rte_always_inline void
+free_hmac_ctx(HMAC_CTX *ctx)
+{
+	HMAC_CTX_free(ctx);
+}
+
+static __rte_always_inline void
+free_cmac_ctx(CMAC_CTX *ctx)
+{
+	CMAC_CTX_free(ctx);
+}
+#endif
+
 #if (OPENSSL_VERSION_NUMBER < 0x10100000L)
 
 static __rte_always_inline int
diff --git a/drivers/crypto/openssl/openssl_pmd_private.h b/drivers/crypto/openssl/openssl_pmd_private.h
index bad7dcf2f5..a50e4d4918 100644
--- a/drivers/crypto/openssl/openssl_pmd_private.h
+++ b/drivers/crypto/openssl/openssl_pmd_private.h
@@ -80,6 +80,20 @@ struct __rte_cache_aligned openssl_qp {
 	 */
 };
 
+struct evp_ctx_pair {
+	EVP_CIPHER_CTX *cipher;
+	union {
+		EVP_MD_CTX *auth;
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		EVP_MAC_CTX *hmac;
+		EVP_MAC_CTX *cmac;
+#else
+		HMAC_CTX *hmac;
+		CMAC_CTX *cmac;
+#endif
+	};
+};
+
 /** OPENSSL crypto private session structure */
 struct __rte_cache_aligned openssl_session {
 	enum openssl_chain_order chain_order;
@@ -168,11 +182,12 @@ struct __rte_cache_aligned openssl_session {
 
 	uint16_t ctx_copies_len;
 	/* < number of entries in ctx_copies */
-	EVP_CIPHER_CTX *qp_ctx[];
-	/**< Flexible array member of per-queue-pair pointers to copies of EVP
-	 * context structure. Cipher contexts are not safe to use from multiple
-	 * cores simultaneously, so maintaining these copies allows avoiding
-	 * per-buffer copying into a temporary context.
+	struct evp_ctx_pair qp_ctx[];
+	/**< Flexible array member of per-queue-pair structures, each containing
+	 * pointers to copies of the cipher and auth EVP contexts. Cipher
+	 * contexts are not safe to use from multiple cores simultaneously, so
+	 * maintaining these copies allows avoiding per-buffer copying into a
+	 * temporary context.
 	 */
 };
 
diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index df44cc097e..7e2e505222 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -892,40 +892,45 @@ openssl_set_session_parameters(struct openssl_session *sess,
 void
 openssl_reset_session(struct openssl_session *sess)
 {
+	/* Free all the qp_ctx entries. */
 	for (uint16_t i = 0; i < sess->ctx_copies_len; i++) {
-		if (sess->qp_ctx[i] != NULL) {
-			EVP_CIPHER_CTX_free(sess->qp_ctx[i]);
-			sess->qp_ctx[i] = NULL;
+		if (sess->qp_ctx[i].cipher != NULL) {
+			EVP_CIPHER_CTX_free(sess->qp_ctx[i].cipher);
+			sess->qp_ctx[i].cipher = NULL;
+		}
+
+		switch (sess->auth.mode) {
+		case OPENSSL_AUTH_AS_AUTH:
+			EVP_MD_CTX_destroy(sess->qp_ctx[i].auth);
+			sess->qp_ctx[i].auth = NULL;
+			break;
+		case OPENSSL_AUTH_AS_HMAC:
+			free_hmac_ctx(sess->qp_ctx[i].hmac);
+			sess->qp_ctx[i].hmac = NULL;
+			break;
+		case OPENSSL_AUTH_AS_CMAC:
+			free_cmac_ctx(sess->qp_ctx[i].cmac);
+			sess->qp_ctx[i].cmac = NULL;
+			break;
 		}
 	}
 
 	EVP_CIPHER_CTX_free(sess->cipher.ctx);
 
-	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
-		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
-
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
 		EVP_MD_CTX_destroy(sess->auth.auth.ctx);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
-		EVP_PKEY_free(sess->auth.hmac.pkey);
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.hmac.ctx);
-# else
-		HMAC_CTX_free(sess->auth.hmac.ctx);
-# endif
+		free_hmac_ctx(sess->auth.hmac.ctx);
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
-# if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		EVP_MAC_CTX_free(sess->auth.cmac.ctx);
-# else
-		CMAC_CTX_free(sess->auth.cmac.ctx);
-# endif
-		break;
-	default:
+		free_cmac_ctx(sess->auth.cmac.ctx);
 		break;
 	}
+
+	if (sess->chain_order == OPENSSL_CHAIN_CIPHER_BPI)
+		EVP_CIPHER_CTX_free(sess->cipher.bpi_ctx);
 }
 
 /** Provide session for operation */
@@ -1471,6 +1476,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (m == 0)
 		goto process_auth_err;
 
+	if (EVP_MAC_init(ctx, NULL, 0, NULL) <= 0)
+		goto process_auth_err;
+
 	src = rte_pktmbuf_mtod_offset(m, uint8_t *, offset);
 
 	l = rte_pktmbuf_data_len(m) - offset;
@@ -1497,11 +1505,9 @@ process_openssl_auth_mac(struct rte_mbuf *mbuf_src, uint8_t *dst, int offset,
 	if (EVP_MAC_final(ctx, dst, &dstlen, DIGEST_LENGTH_MAX) != 1)
 		goto process_auth_err;
 
-	EVP_MAC_CTX_free(ctx);
 	return 0;
 
 process_auth_err:
-	EVP_MAC_CTX_free(ctx);
 	OPENSSL_LOG(ERR, "Process openssl auth failed");
 	return -EINVAL;
 }
@@ -1620,7 +1626,7 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	if (sess->ctx_copies_len == 0)
 		return sess->cipher.ctx;
 
-	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id];
+	EVP_CIPHER_CTX **lctx = &sess->qp_ctx[qp->id].cipher;
 
 	if (unlikely(*lctx == NULL)) {
 #if OPENSSL_VERSION_NUMBER >= 0x30200000L
@@ -1647,6 +1653,112 @@ get_local_cipher_ctx(struct openssl_session *sess, struct openssl_qp *qp)
 	return *lctx;
 }
 
+static inline EVP_MD_CTX *
+get_local_auth_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+	/* If the array is not being used, just return the main context. */
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.auth.ctx;
+
+	EVP_MD_CTX **lctx = &sess->qp_ctx[qp->id].auth;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30100000L
+		/* EVP_MD_CTX_dup() added in OSSL 3.1 */
+		*lctx = EVP_MD_CTX_dup(sess->auth.auth.ctx);
+#else
+		*lctx = EVP_MD_CTX_new();
+		EVP_MD_CTX_copy(*lctx, sess->auth.auth.ctx);
+#endif
+	}
+
+	return *lctx;
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline HMAC_CTX *
+#endif
+get_local_hmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30003000L)
+	/* For OpenSSL versions 3.0.0 <= v < 3.0.3, re-initing of
+	 * EVP_MAC_CTXs is broken, and doesn't actually reset their
+	 * state. This was fixed in OSSL commit c9ddc5af5199 ("Avoid
+	 * undefined behavior of provided macs on EVP_MAC
+	 * reinitialization"). In cases where the fix is not present,
+	 * fall back to duplicating the context every buffer as a
+	 * workaround, at the cost of performance.
+	 */
+	RTE_SET_USED(qp);
+	return EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.hmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	HMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].hmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
+#else
+		*lctx = HMAC_CTX_new();
+		HMAC_CTX_copy(*lctx, sess->auth.hmac.ctx);
+#endif
+	}
+
+	return *lctx;
+#endif
+}
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+static inline EVP_MAC_CTX *
+#else
+static inline CMAC_CTX *
+#endif
+get_local_cmac_ctx(struct openssl_session *sess, struct openssl_qp *qp)
+{
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30003000L)
+	/* For OpenSSL versions 3.0.0 <= v < 3.0.3, re-initing of
+	 * EVP_MAC_CTXs is broken, and doesn't actually reset their
+	 * state. This was fixed in OSSL commit c9ddc5af5199 ("Avoid
+	 * undefined behavior of provided macs on EVP_MAC
+	 * reinitialization"). In cases where the fix is not present,
+	 * fall back to duplicating the context every buffer as a
+	 * workaround, at the cost of performance.
+	 */
+	RTE_SET_USED(qp);
+	return EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+	if (sess->ctx_copies_len == 0)
+		return sess->auth.cmac.ctx;
+
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+	EVP_MAC_CTX **lctx =
+#else
+	CMAC_CTX **lctx =
+#endif
+		&sess->qp_ctx[qp->id].cmac;
+
+	if (unlikely(*lctx == NULL)) {
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+		*lctx = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
+#else
+		*lctx = CMAC_CTX_new();
+		CMAC_CTX_copy(*lctx, sess->auth.cmac.ctx);
+#endif
+	}
+
+	return *lctx;
+#endif
+}
+
 /** Process auth/cipher combined operation */
 static void
 process_openssl_combined_op(struct openssl_qp *qp, struct rte_crypto_op *op,
@@ -1895,42 +2007,40 @@ process_openssl_auth_op(struct openssl_qp *qp, struct rte_crypto_op *op,
 
 	switch (sess->auth.mode) {
 	case OPENSSL_AUTH_AS_AUTH:
-		ctx_a = EVP_MD_CTX_create();
-		EVP_MD_CTX_copy_ex(ctx_a, sess->auth.auth.ctx);
+		ctx_a = get_local_auth_ctx(sess, qp);
 		status = process_openssl_auth(mbuf_src, dst,
 				op->sym->auth.data.offset, NULL, NULL, srclen,
 				ctx_a, sess->auth.auth.evp_algo);
-		EVP_MD_CTX_destroy(ctx_a);
 		break;
 	case OPENSSL_AUTH_AS_HMAC:
+		ctx_h = get_local_hmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_h = EVP_MAC_CTX_dup(sess->auth.hmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
 # else
-		ctx_h = HMAC_CTX_new();
-		HMAC_CTX_copy(ctx_h, sess->auth.hmac.ctx);
 		status = process_openssl_auth_hmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_h);
-		HMAC_CTX_free(ctx_h);
 # endif
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30003000L)
+		EVP_MAC_CTX_free(ctx_h);
+#endif
 		break;
 	case OPENSSL_AUTH_AS_CMAC:
+		ctx_c = get_local_cmac_ctx(sess, qp);
 # if OPENSSL_VERSION_NUMBER >= 0x30000000L
-		ctx_c = EVP_MAC_CTX_dup(sess->auth.cmac.ctx);
 		status = process_openssl_auth_mac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
 # else
-		ctx_c = CMAC_CTX_new();
-		CMAC_CTX_copy(ctx_c, sess->auth.cmac.ctx);
 		status = process_openssl_auth_cmac(mbuf_src, dst,
 				op->sym->auth.data.offset, srclen,
 				ctx_c);
-		CMAC_CTX_free(ctx_c);
 # endif
+#if (OPENSSL_VERSION_NUMBER >= 0x30000000L && OPENSSL_VERSION_NUMBER < 0x30003000L)
+		EVP_MAC_CTX_free(ctx_c);
+#endif
 		break;
 	default:
 		status = -1;
diff --git a/drivers/crypto/openssl/rte_openssl_pmd_ops.c b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
index 4209c6ab6f..1bbb855a59 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd_ops.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd_ops.c
@@ -805,7 +805,7 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 		unsigned int max_nb_qps = ((struct openssl_private *)
 				dev->data->dev_private)->max_nb_qpairs;
 		return sizeof(struct openssl_session) +
-				(sizeof(void *) * max_nb_qps);
+				(sizeof(struct evp_ctx_pair) * max_nb_qps);
 	}
 
 	/*
@@ -818,10 +818,11 @@ openssl_pmd_sym_session_get_size(struct rte_cryptodev *dev)
 
 	/*
 	 * Otherwise, the size of the flexible array member should be enough to
-	 * fit pointers to per-qp contexts.
+	 * fit pointers to per-qp contexts. This is twice the number of queue
+	 * pairs, to allow for auth and cipher contexts.
 	 */
 	return sizeof(struct openssl_session) +
-		(sizeof(void *) * dev->data->nb_queue_pairs);
+		(sizeof(struct evp_ctx_pair) * dev->data->nb_queue_pairs);
 }
 
 /** Returns the size of the asymmetric session structure */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 5/5] crypto/openssl: only set cipher padding once
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                     ` (3 preceding siblings ...)
  2024-06-07 12:47   ` [PATCH v4 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
@ 2024-06-07 12:47   ` Jack Bond-Preston
  4 siblings, 0 replies; 34+ messages in thread
From: Jack Bond-Preston @ 2024-06-07 12:47 UTC (permalink / raw)
  To: Kai Ji; +Cc: dev, Wathsala Vithanage

Setting the cipher padding has a noticeable performance footprint,
and it doesn't need to be done for every call to
process_openssl_cipher_{en,de}crypt(). Setting it causes OpenSSL to set
it on every future context re-init. Thus, for every buffer after the
first one, the padding is being set twice.

Instead, just set the cipher padding once - when configuring the session
parameters - avoiding the unnecessary double setting behaviour. This is
skipped for AEAD ciphers, where disabling padding is not necessary.

Throughput performance uplift measurements for AES-CBC-128 encrypt on
Ampere Altra Max platform:
1 worker lcore
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.97 |               3.72 |    25.2% |
|             256 |          8.10 |               9.42 |    16.3% |
|            1024 |         14.22 |              15.18 |     6.8% |
|            2048 |         16.28 |              16.93 |     4.0% |
|            4096 |         17.58 |              17.97 |     2.2% |

8 worker lcores
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         21.27 |              29.85 |    40.3% |
|             256 |         60.05 |              75.53 |    25.8% |
|            1024 |        110.11 |             121.56 |    10.4% |
|            2048 |        128.05 |             135.40 |     5.7% |
|            4096 |        139.45 |             143.76 |     3.1% |

Signed-off-by: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Reviewed-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
---
 drivers/crypto/openssl/rte_openssl_pmd.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/openssl/rte_openssl_pmd.c b/drivers/crypto/openssl/rte_openssl_pmd.c
index 7e2e505222..101111e85b 100644
--- a/drivers/crypto/openssl/rte_openssl_pmd.c
+++ b/drivers/crypto/openssl/rte_openssl_pmd.c
@@ -619,6 +619,8 @@ openssl_set_session_cipher_parameters(struct openssl_session *sess,
 		return -ENOTSUP;
 	}
 
+	EVP_CIPHER_CTX_set_padding(sess->cipher.ctx, 0);
+
 	return 0;
 }
 
@@ -1124,8 +1126,6 @@ process_openssl_cipher_encrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_EncryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_encrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_encryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_encrypt_err;
@@ -1174,8 +1174,6 @@ process_openssl_cipher_decrypt(struct rte_mbuf *mbuf_src, uint8_t *dst,
 	if (EVP_DecryptInit_ex(ctx, NULL, NULL, NULL, iv) <= 0)
 		goto process_cipher_decrypt_err;
 
-	EVP_CIPHER_CTX_set_padding(ctx, 0);
-
 	if (process_openssl_decryption_update(mbuf_src, offset, &dst,
 			srclen, ctx, inplace))
 		goto process_cipher_decrypt_err;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] OpenSSL PMD Optimisations
  2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
                   ` (8 preceding siblings ...)
  2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
@ 2024-06-24 16:14 ` Ji, Kai
  9 siblings, 0 replies; 34+ messages in thread
From: Ji, Kai @ 2024-06-24 16:14 UTC (permalink / raw)
  To: Jack Bond-Preston; +Cc: dev

[-- Attachment #1: Type: text/plain, Size: 19096 bytes --]

Series-acked-by: Kai Ji <kai.ji@intel.com>
________________________________
From: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
Sent: 03 June 2024 17:01
Cc: dev@dpdk.org <dev@dpdk.org>
Subject: [PATCH 0/5] OpenSSL PMD Optimisations

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

--
2.34.1


[-- Attachment #2: Type: text/html, Size: 37786 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-06-24 16:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-03 16:01 [PATCH 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
2024-06-03 16:01 ` [PATCH 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
2024-06-03 16:12   ` Jack Bond-Preston
2024-06-03 16:01 ` [PATCH 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
2024-06-03 16:01 ` [PATCH 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
2024-06-03 16:01 ` [PATCH 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
2024-06-03 16:30   ` Jack Bond-Preston
2024-06-03 16:01 ` [PATCH 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
2024-06-03 18:43 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
2024-06-03 18:43   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
2024-06-03 18:43   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
2024-06-03 18:43   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
2024-06-03 18:43   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
2024-06-03 18:43   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
2024-06-03 18:59 ` [PATCH v2 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
2024-06-03 18:59   ` [PATCH v2 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
2024-06-03 18:59   ` [PATCH v2 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
2024-06-03 18:59   ` [PATCH v2 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
2024-06-03 18:59   ` [PATCH v2 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
2024-06-03 18:59   ` [PATCH v2 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
2024-06-06 10:20 ` [PATCH v3 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
2024-06-06 10:20   ` [PATCH v3 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
2024-06-06 10:44     ` [EXTERNAL] " Akhil Goyal
2024-06-06 10:20   ` [PATCH v3 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
2024-06-06 10:20   ` [PATCH v3 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
2024-06-06 10:20   ` [PATCH v3 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
2024-06-06 10:20   ` [PATCH v3 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
2024-06-07 12:47 ` [PATCH v4 0/5] OpenSSL PMD Optimisations Jack Bond-Preston
2024-06-07 12:47   ` [PATCH v4 1/5] crypto/openssl: fix GCM and CCM thread unsafe ctxs Jack Bond-Preston
2024-06-07 12:47   ` [PATCH v4 2/5] crypto/openssl: only init 3DES-CTR key + impl once Jack Bond-Preston
2024-06-07 12:47   ` [PATCH v4 3/5] crypto/openssl: per-qp cipher context clones Jack Bond-Preston
2024-06-07 12:47   ` [PATCH v4 4/5] crypto/openssl: per-qp auth " Jack Bond-Preston
2024-06-07 12:47   ` [PATCH v4 5/5] crypto/openssl: only set cipher padding once Jack Bond-Preston
2024-06-24 16:14 ` [PATCH 0/5] OpenSSL PMD Optimisations Ji, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).