From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9BEC94414A;
	Mon,  3 Jun 2024 18:01:28 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 2E2BC42E38;
	Mon,  3 Jun 2024 18:01:28 +0200 (CEST)
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by mails.dpdk.org (Postfix) with ESMTP id 2BF694067B
 for <dev@dpdk.org>; Mon,  3 Jun 2024 18:01:26 +0200 (CEST)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CEBE01042
 for <dev@dpdk.org>; Mon,  3 Jun 2024 09:01:49 -0700 (PDT)
Received: from cesw-amp-gbt-1s-m12830-01.lab.cambridge.arm.com
 (cesw-amp-gbt-1s-m12830-01.lab.cambridge.arm.com [10.7.10.57])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 16AEE3F64C
 for <dev@dpdk.org>; Mon,  3 Jun 2024 09:01:24 -0700 (PDT)
From: Jack Bond-Preston <jack.bond-preston@foss.arm.com>
To: 
Cc: dev@dpdk.org
Subject: [PATCH 0/5] OpenSSL PMD Optimisations
Date: Mon,  3 Jun 2024 16:01:14 +0000
Message-Id: <20240603160119.1279476-1-jack.bond-preston@foss.arm.com>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

The current implementation of the OpenSSL PMD has numerous performance issues.
These revolve around certain operations being performed on a per buffer/packet
basis, when they in fact could be performed less often - usually just during
initialisation.


[1/5]: fix GCM and CCM thread unsafe ctxs
=========================================
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is
implemented in the same naive (and inefficient) way as existing fixes for other
ciphers, and is optimised later in [3/5].


[2/5]: only init 3DES-CTR key + impl once
===========================================
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.


[5/5]: only set cipher padding once
=====================================
Fixes an inefficient usage of the OpenSSL API when disabling padding for
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:
fix extra bytes written at end of data"), which fixes a bug - however, the
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.

This patch fixes this, preventing the padding being disabled for the cipher
twice per buffer (with the second essentially being a wasteful no-op).


[3/5] and [4/5]: per-queue-pair context clones
==============================================
[3/5] and [4/5] aim to fix the key issue that was identified with the
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a
per-buffer basis.
This behaviour was introduced in 2019:
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a
> Author: Thierry Herbelot <thierry.herbelot@6wind.com>
> Date:   Wed Sep 11 18:06:01 2019 +0200
>
>     crypto/openssl: use local copy for session contexts
>
>     Session contexts are used for temporary storage when processing a
>     packet.
>     If packets for the same session are to be processed simultaneously on
>     multiple cores, separate contexts must be used.
>
>     Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a
>     variable on the stack: it must be allocated. This in turn reduces the
>     performance.

Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used
from multiple threads simultaneously, so this patch is required for correctness
(assuming the need to support using the same openssl_session across multiple
lcores). The downside here is that, as the commit message notes, this does
reduce performance quite significantly.

It is worth noting that while ciphers were already correctly cloned for cipher
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this
issue of correctness, and [3/5] implements a more performant fix on top of this.

These two patches aim to remedy the performance loss caused by the introduction
of cipher context cloning. An approach of maintaining an array of pointers,
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL
CTXs is used. Consequently, there is no need to perform cloning of the context
for every buffer - whilst keeping the guarantee that one context is not being
used on multiple lcores simultaneously. The cloning of the main context into the
array's per-qp context entries is performed lazily/as-needed. There are some
trade-offs/judgement calls that were made:
 - The first call for a queue pair for an op from a given openssl_session will
   be roughly equivalent to an op from the existing implementation. However, all
   subsequent calls for the same openssl_session on the same queue pair will not
   incur this extra work. Thus, whilst the first op on a session on a queue pair
   will be slower than subsequent ones, this slower first op is still equivalent
   to *every* op without these patches. The alternative would be pre-populating
   this array when the openssl_session is initialised, but this would waste
   memory and processing time if not all queue pairs end up doing work from this
   openssl_session.
 - Each pointer inside the array of per-queue-pair pointers has not been cache
   aligned, because updates only occur on the first buffer per-queue-pair
   per-session, making the impact of false sharing negligible compared to the
   extra memory usage of the alignment.

[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).

Compared to before, this approach comes with a drawback of extra memory usage -
the cause of which is twofold:
- The openssl_session struct has grown to accommodate the array, with a length
  equal to the number of qps in use multiplied by 2 (to allow auth and cipher
  contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()
  is modified to return a size large enough to support this. At the time this
  function is called (before the user creates the session mempool), the PMD may
  not yet be configured with the requested number of queue pairs. In this case,
  the maximum number of queue pairs allowed by the PMD (current default is 8) is
  used, to ensure the allocations will be large enough. Thus, the user may be
  able to slightly reduce the memory used by OpenSSL sessions by first
  configuring the PMD's queue pair count, then requesting the size of the
  sessions and creating the session mempool. There is also a special case where
  the number of queue pairs is 1, in which case the array is not allocated or
  used at all. Overall, this memory usage by the session structure itself is
  worst-case 128 bytes per session (the default maximum number of queue pairs
  allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra
  space to store the length of the array and auth context offset, resulting in
  an increase in total size from 152 bytes to 280 bytes.
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the
  clones were allocated and freed per-operation, meaning the lifetime of the
  allocations was only the duration of the operation. Now, these allocations are
  lifted out to share the lifetime of the session. This results in situations
  with many long-lived sessions shared across many queue pairs causing an
  increase in total memory usage.


Performance Comparisons
=======================
Benchmarks were collected using dpdk-test-crypto-perf, for the following
configurations:
 - The version of OpenSSL used was 3.3.0
 - The hardware used for the benchmarks was the following two machine configs:
     * AArch64: Ampere Altra Max (128 N1 cores, 1 socket)
     * x86    : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)
 - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,
   4096, 8192.
 - The worker lcore counts tested were: 1, 2, 4, 8
 - The algorithms and associated operations tested were:
     * Cipher-only       AES-CBC-128           (Encrypt and Decrypt)
     * Cipher-only       3DES-CTR-128          (Encrypt only)
     * Auth-only         SHA1-HMAC             (Generate only)
     * Auth-only         AES-CMAC              (Generate only)
     * AESNI             AES-GCM-128           (Encrypt and Decrypt)
     * Cipher-then-Auth  AES-CBC-128-HMAC-SHA1 (Encrypt only)
  - EAL was configured with Legacy Memory Mode enabled.
The application was always run on isolated CPU cores on the same socket.

The sets of patches applied for benchmarks were:
 - No patches applied (HEAD of upstream main)
 -   [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)
 - [1-2/5] applied (adds 3DES-CTR fix)
 - [1-3/5] applied (adds per-qp cipher contexts)
 - [1-4/5] applied (adds per-qp auth contexts)
 - [1-5/5] applied (adds cipher padding setting fix)

For brevity, all results included in the cover letter are from the Arm platform,
with all patches applied. Very similar results were achieved on the Intel
platform, and the full set of results, including the Intel ones, is available.

AES-CBC-128 Encrypt Throughput Speedup
--------------------------------------
A comparison of the throughput speedup achieved between the base (main branch
HEAD) and optimised (all patches applied) versions of the PMD was carried out,
with the varying worker lcore counts.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.84 |               2.04 |   144.6% |
|              64 |          1.61 |               3.72 |   131.3% |
|             128 |          2.97 |               6.24 |   110.2% |
|             256 |          5.14 |               9.42 |    83.2% |
|             512 |          8.10 |              12.62 |    55.7% |
|            1024 |         11.37 |              15.18 |    33.5% |
|            2048 |         14.26 |              16.93 |    18.7% |
|            4096 |         16.35 |              17.97 |     9.9% |
|            8192 |         17.61 |              18.51 |     5.1% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.53 |              16.49 |   974.8% |
|              64 |          3.04 |              29.85 |   881.3% |
|             128 |          5.96 |              50.07 |   739.8% |
|             256 |         10.54 |              75.53 |   616.5% |
|             512 |         21.60 |             101.14 |   368.2% |
|            1024 |         41.27 |             121.56 |   194.6% |
|            2048 |         72.99 |             135.40 |    85.5% |
|            4096 |        103.39 |             143.76 |    39.0% |
|            8192 |        125.48 |             148.06 |    18.0% |

It is evident from these results that the speedup with 8 worker lcores is
significantly larger. This was surprising at first, so profiling of the existing
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX
contains an EVP_CIPHER, which represents the actual cipher algorithm
implementation backing this context. OpenSSL holds only one instance of each
EVP_CIPHER, and uses a reference counter to track freeing them. This means that
the original implementation spends a very high amount of time incrementing and
decrementing this reference counter in EVP_CIPHER_CTX_copy and
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,
this reference count modification happens extremely frequently - thrashing this
refcount on all lcores and causing a huge slowdown. The optimised version avoids
this by not performing the copy and free (and thus associated refcount
modifications) on every buffer.

SHA1-HMAC Generate Throughput Speedup
-------------------------------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.32 |               0.76 |   135.9% |
|              64 |          0.63 |               1.43 |   126.9% |
|             128 |          1.21 |               2.60 |   115.4% |
|             256 |          2.23 |               4.42 |    98.1% |
|             512 |          3.88 |               6.80 |    75.5% |
|            1024 |          6.13 |               9.30 |    51.8% |
|            2048 |          8.65 |              11.39 |    31.7% |
|            4096 |         10.90 |              12.85 |    17.9% |
|            8192 |         12.54 |              13.74 |     9.5% |
8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.49 |               5.99 |  1110.3% |
|              64 |          0.98 |              11.30 |  1051.8% |
|             128 |          1.95 |              20.67 |   960.3% |
|             256 |          3.90 |              35.18 |   802.4% |
|             512 |          7.83 |              54.13 |   590.9% |
|            1024 |         15.80 |              74.11 |   369.2% |
|            2048 |         31.30 |              90.97 |   190.6% |
|            4096 |         58.59 |             102.70 |    75.3% |
|            8192 |         85.93 |             109.88 |    27.9% |

We can see the results are similar as for AES-CBC-128 cipher operations.

AES-GCM-128 Encrypt Throughput Speedup
--------------------------------------
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for
the concurrency bug introduces a large overhead.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |          2.60 |               1.31 |   -49.5% |
|             256 |          7.69 |               4.45 |   -42.1% |
|            1024 |         15.33 |              11.30 |   -26.3% |
|            2048 |         18.74 |              15.37 |   -18.0% |
|            4096 |         21.11 |              18.80 |   -10.9% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              64 |         19.94 |               2.83 |   -85.8% |
|             256 |         58.84 |              11.00 |   -81.3% |
|            1024 |        119.71 |              42.46 |   -64.5% |
|            2048 |        147.69 |              80.91 |   -45.2% |
|            4096 |        167.39 |             121.25 |   -27.6% |

However, applying [3/5] rectifies most of this performance drop, as shown by the
following results with it applied.

1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          1.39 |               1.28 |    -7.8% |
|              64 |          2.60 |               2.44 |    -6.2% |
|             128 |          4.77 |               4.45 |    -6.8% |
|             256 |          7.69 |               7.22 |    -6.1% |
|             512 |         11.31 |              10.97 |    -3.0% |
|            1024 |         15.33 |              15.07 |    -1.7% |
|            2048 |         18.74 |              18.51 |    -1.2% |
|            4096 |         21.11 |              20.96 |    -0.7% |
|            8192 |         22.55 |              22.50 |    -0.2% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |         10.59 |              10.35 |    -2.3% |
|              64 |         19.94 |              19.46 |    -2.4% |
|             128 |         36.32 |              35.64 |    -1.9% |
|             256 |         58.84 |              57.80 |    -1.8% |
|             512 |         87.38 |              87.37 |    -0.0% |
|            1024 |        119.71 |             120.22 |     0.4% |
|            2048 |        147.69 |             147.93 |     0.2% |
|            4096 |        167.39 |             167.48 |     0.1% |
|            8192 |        179.80 |             179.87 |     0.0% |

The results show that, for AES-GCM-128 encrypt, there is still a small slowdown
at smaller buffer sizes. This represents the overhead required to make AES-GCM
thread safe. These patches have rectified this lack of safety without causing a
significant performance impact, especially compared to naive per-buffer cipher
context cloning.

3DES-CTR Encrypt
----------------
1 worker lcore:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.12 |               0.22 |    89.7% |
|              64 |          0.16 |               0.22 |    43.6% |
|             128 |          0.18 |               0.23 |    22.3% |
|             256 |          0.20 |               0.23 |    10.8% |
|             512 |          0.21 |               0.23 |     5.1% |
|            1024 |          0.22 |               0.23 |     2.7% |
|            2048 |          0.22 |               0.23 |     1.3% |
|            4096 |          0.23 |               0.23 |     0.4% |
|            8192 |          0.23 |               0.23 |     0.4% |

8 worker lcores:
|   buffer sz (B) |   prev (Gbps) |   optimised (Gbps) |   uplift |
|-----------------+---------------+--------------------+----------|
|              32 |          0.68 |               1.77 |   160.1% |
|              64 |          1.00 |               1.78 |    78.3% |
|             128 |          1.29 |               1.80 |    39.6% |
|             256 |          1.50 |               1.80 |    19.8% |
|             512 |          1.64 |               1.80 |    10.0% |
|            1024 |          1.72 |               1.81 |     5.1% |
|            2048 |          1.76 |               1.81 |     2.7% |
|            4096 |          1.78 |               1.81 |     1.5% |
|            8192 |          1.80 |               1.81 |     0.7% |

[1/4] yields good results - the performance increase is high for lower buffer
sizes, where the cost of re-initialising the extra parameters is more
significant compared to the cost of the cipher operation.

Full Data and Additional Bar Charts
-----------------------------------
The full raw data (CSV) and a PDF of all generated figures (all generated
speedup tables, plus additional bar charts showing the throughput comparison
across different sets of applied patches) - for both Intel and Arm platforms -
are available. However, I'm not sure of the ettiquette regarding attachments of
such files, so I haven't attached them for now. If you are interested in
reviewing them, please reach out and I will find a way to get them to you.

Jack Bond-Preston (5):
  crypto/openssl: fix GCM and CCM thread unsafe ctxs
  crypto/openssl: only init 3DES-CTR key + impl once
  crypto/openssl: per-qp cipher context clones
  crypto/openssl: per-qp auth context clones
  crypto/openssl: only set cipher padding once

 drivers/crypto/openssl/compat.h              |  26 ++
 drivers/crypto/openssl/openssl_pmd_private.h |  26 +-
 drivers/crypto/openssl/rte_openssl_pmd.c     | 244 ++++++++++++++-----
 drivers/crypto/openssl/rte_openssl_pmd_ops.c |  35 ++-
 4 files changed, 260 insertions(+), 71 deletions(-)

-- 
2.34.1