DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
@ 2020-02-04 14:45 David Coyle
  2020-02-04 19:52 ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: David Coyle @ 2020-02-04 14:45 UTC (permalink / raw)
  To: dev; +Cc: declan.doherty, fiona.trahe

Introduction
============

This RFC introduces a new DPDK library, rte_accelerator.

The main aim of this library is to provide a flexible and extensible way of combining one or more packet-processing functions into a single operation, thereby allowing these to be performed in parallel in optimized software libraries or in a hardware accelerator. These functions can include cryptography, compression and CRC/checksum calculation, while others can potentially be added in the future. Performing these functions in parallel as a single operation can enable a significant performance improvement.


Background
==========

There are a number of byte-wise operations which are present and common across many access network data-plane pipelines, such as Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some prototyping has been done at Intel in relation to the 01.org access-network-dataplanes project to prove that a significant performance improvement is possible when such byte-wise operations are combined into a single pass of packet data processing. This performance boost has been prototyped for both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.

The prototypes used some protocol-specific modifications to the DPDK cryptodev library. In order to make this performance improvement consumable by network access equipment vendors, a more extensible and correct solution is required that can be upstreamed into DPDK.

Hence, the introduction of rte_accelerator.


Use Cases
=========

The primary use cases for this new library have already been mentioned. These are:

- DOCSIS MAC: Crypto-CRC
	- Order:
		- Downstream: CRC, Encrypt
		- Upstream: Decrypt, CRC
	- Specifications:
		- Crypto: 128-bit AES-CFB encryption variant for DOCSIS as described in section 11.1 of DOCSIS 3.1 Security Specification (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
		- CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]

- XGS-PON MAC: Crypto-CRC-BIP
	- Order:
		- Downstream: CRC, Encrypt, BIP
		- Upstream: BIP, Decrypt, CRC
	- Specifications:
		- Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode (AES-CTR), as described in [NIST SP800-38A].
		- CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
		- BIP: 4-byte bit-interleaved even parity (BIP) field computed over the entire FS frame, refer to  ITU-T G.989.3, sections 8.1.1.5 and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-G.989.3-201510-I!!PDF-E)

Note that support for both these chained operations is already available in the Intel IPSec Multi-Buffer library.

However, it is not limited to these. The following are some of the other possible use-cases, which rte_accelerator will allow for:

- Storage:
	- Compression followed by Encryption
- IPSec over UDP:
	- UDP Checksum calculation followed by Encryption

While DPDK's rte_cryptodev and rte_compressdev allow many cryptographic and compression algorithms to be chained together in one operation, there is no way to chain these with any error detection or checksum algorithms. And there is no way to chain crypto and compression algorithms together. rte_accelerator will allow these chains to be created, and also allow any future type of operation to be easily added.


Architecture
============

The following diagram shows where rte_accelerator fits in an overall application architecture.

As can be seen from the diagram, the rte_accelerator API will depend on existing DPDK device libraries (i.e. rte_cryptodev and rte_compressdev) for existing features like crypto and compression. However, any new services/functions not covered in the existing libraries, such as CRC and BIP, will be provided directly by the rte_accelerator API:

    +-------------------------------------------------------------------------------------+
    |                                                                                     |
    |                                   Application                                       |
    |                     (e.g. vCMTS (DOCSIS), vOLT (XGS-PON), etc.)                     |
    |                                                                                     |
    +-------------------------------------------------------------------------------------+
                      |                |                                 |
    +-----------------|----------------|---------------------------------|----------------+
    |                 |                |   DPDK                          |                |
    |                 |                |                                 |                |
    |                 |                | Session creation/deletion       |                |
    |                 |                ~ Operation creation/deletion     |                |
    |    Device       |                | Operation enqueue/dequeue       |    Device      |
    | initialization, |                |                                 | initialization,|
    | configuration,  ~     +--------------------------------------+     ~ configuration, |
    |    reset        |     |  rte_accelerator                     |     |    reset       |
    |                 |     |   |-- err_detect xform/op            |     |                |
    |                 |     |   |-- Session creation/deletion fns  |     |                |
    |                 |     |   `-- Operation enqueue/dequeue fns  |     |                |
    |                 |     +--------------------------------------+     |                |
    |                 |         /       /      \             \           |                |
    |                 |        /       /        \             \          |                |
    |                 |       /       |          |             \         |                |
    |  +------------------------+     |          |     +------------------------+         |
    |  |  rte_cryptodev         |     |          |     |  rte_compressdev       |         |
    |  |   |-- sym xforms/ops   |     |          |     |   `-- comp xforms/ops  |   ...   |
    |  |   `-- asym xforms/ops  |     |          |     |                        |         |
    |  +------------------------+     |          |     +------------------------+         |
    |                  \     \        /          |             /                          |
    |                   \     \______/_________   \      _____/                           |
    |                    \          /          \   \    /                                 |
    |                   +------------+       +------------+                               |
    |                   |            |       |            |                               |
    |                   |  AESNI-MB  |       |    QAT     |      ...                      |
    |                   |    PMD     |       |    PMD     |                               |
    |                   |            |       |            |                               |
    |                   +------------+       +------------+                               |
    |                          |                    |                                     |
    +--------------------------|--------------------|-------------------------------------+
                               |                    |
                        +------------+       +------------+
                        |            |       |            |
                        |  AESNI-MB  |       |   QAT HW   |
                        |   SW LIB   |       |            |
                        |            |       |            |
                        +------------+       +------------+

Some key points are:
1) rte_accelerator will use xform and operation related definitions (i.e. structs, enums, defines) from existing the device libraries, such as rte_cryptodev and rte_compressdev, to create accelerator xform chains, sessions and operation chains
   a) this allows as much re-use of existing definitions as possible
2) The application code will use the existing device library functions to initialize, configure, start, stop and reset the device
3) The application code will call rte_accelerator to create/delete accelerator sessions and enqueue/dequeue accelerator operations
   a) Each device PMD will register functions with rte_accelerator to perform these tasks
4) rte_accelerator will provide definitions/structs/etc. for the error detection xform and operation
   a) the error detection 'algorithms' initially supported will include CRC32 and BIP32, but others can be added in the future as needed
   b) other xform and operation types which do not fit in any existing device libraries, such as rte_cryptodev or rte_compressdev, but need to be accelerated can also be added under rte_accelerator in the future
   c) note, however, that error detection and other xform and operation types under rte_accelerator could be moved to an independent library in the future, if required
5) rte_accelerator will not provide a capability check feature. Instead, rte_accelerator will return an error code on session creation if the xform chain specified is not supported by the underlying device
6) rte_accelerator does not support session-less mode of operation
   a) sessions MUST always be created and attached to the operations being enqueued
7) Initially, support will be added to existing AESNI-MB and QAT PMDs for rte_accelerator functionality. Support can be added to other PMDs in the future as needed
   a) To add support for rte_accelerator functionality, the PMD must implement and register the required callback functions mentioned in 3a) above


Key API Definitions
===================

The full proposed rte_accelerator API is provided at the end of the RFC.
Here, some of the key structures and functions are described.

The following structure defines an accelerator xform
- Accelerator xforms are chained together through the next field
- The accelerator xform can contain a crypto (symmetric or asymmetric), compression or error detection xform. Others can be added in the future as needed
- The order of xforms in the chain specifies the order in which those operations should be performed on the packet data

/**
 * Accelerator transform setup data
 *
 * This structure is used to specify the accelerator transforms required.
 * Multiple transforms can be chained together to specify a chain of transforms
 * such as symmetric crypto followed by error detection, or compression followed
 * by symmetric crypto. Each transform structure holds a single transform, with
 * the type field specifying which transform is contained within the union.
 */
struct rte_accelerator_xform {
	struct rte_accelerator_xform *next;
	/**<
	 * Next transform in the chain
	 * - the last transform in the chain MUST set this to NULL
	 */
	enum rte_accelerator_xform_type type;
	/**< Transform type */

	RTE_STD_C11
	union {
		struct rte_crypto_sym_xform crypto_sym;
		/**< Symmetric crypto transform */
		struct rte_crypto_asym_xform crypto_asym;
		/**< Asymmetric crypto transform */
		struct rte_comp_xform comp;
		/**< Compression transform */
		struct rte_err_detect_xform err_detect;
		/**< Error detection transform */
	};
};

The following structure defines an accelerator operation
- Accelerator operations are chained together through the next field
- The order of operations in this chain MUST match the order of xforms in the session's xform chain
- Any additional operation data (e.g. IV data) must follow immediately after this struct in memory
- The following fields MUST be set in the FIRST operation of the chain before enqueuing. These fields are ignored in the inner op structures and any subsequent rte_accelerator_op chain elements:
	- sess
	- m_src
	- m_dst
- The following fields MUST be set in ALL operations in a chain before enqueuing:
	- next
	- mempool
	- type
- After dequeuing, only the first operation in the chain will contain the overall status in the overall_status field and each chain element will contain it's individual status in the op_status field

/**
 * Accelerator operation data
 *
 * This structure is used to specify the operations for a particular session.
 * This includes specifying the source and, if required, destination mbufs and
 * the lengths and offsets of the data within these mbufs on which the
 * operations should be done. Multiple operations are chained together to
 * specify the full set of operations to be performed
 */
struct rte_accelerator_op {
	struct rte_accelerator_op *next;
	/**<
	 * Next operation in the chain
	 * - the last operation in the chain MUST set this to NULL
	 */
	struct rte_accelerator_session *sess;
	/**< Handle for the associated accelerator session */

	struct rte_mempool *mempool;
	/**< Mempool from which the operation is allocated */

	struct rte_mbuf *m_src; /**< Source mbuf */
	struct rte_mbuf *m_dst; /**< Destination mbuf */

	enum rte_accelerator_op_status overall_status;
	/**<
	 * Overall operation status
	 * - indicates if all the operations in the chain succeeded or if any
	 *   one of them failed
	 */

	uint8_t op_status;
	/**<
	 * Individual operation status
	 * - indicates the status of the individual operation in the chain
	 */

	enum rte_accelerator_op_type type;
	/**< Operation type */

	RTE_STD_C11
	union {
		struct rte_crypto_sym_op crypto_sym;
		/**< Symmetric crypto operation */
		struct rte_crypto_asym_op crypto_asym;
		/**< Asymmetric crypto operation */
		struct rte_comp_op comp;
		/**< Compression operation */
		struct rte_err_detect_op err_detect;
		/**< Error detection operation */
	};
};

The accelerator API is mainly a device-based API, with existing device PMDs extended to provide support for the API. When a device that supports the accelerator API initialises, it must setup it's accelerator context in it's private device data. The following structure defines an accelerator context - it contains a pointer to the device itself, a pointer to a structure containing the device's callback functions for session creation/deletion, operation enqueue/dequeue etc. and a count of the number of sessions attached to this context.

/**
 * Accelerator context for a device
 *
 * Accelerator instance for each driver to register their accelerator
 * operations. The application can get the accelerator context from the
 * underlying device using the API functions provided by the device's main API
 * library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
 * rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
 */
struct rte_accelerator_ctx {
	void *device;
	/**< Pointer to the device */
	const struct rte_accelerator_ops *ops;
	/**< Pointer to accelerator ops for the device */
	uint16_t sess_cnt;
	/**< Number of sessions attached to this context */
};

As the comment on struct rte_accelerator_ctx mentions, rte_cryptodev, rte_compressdev and any other device library which supports the rte_accelerator API in the future must provide API functions to get the accelerator context for a device based on a device id. In the cases of rte_cryptodev and rte_compressdev, these functions will have the following prototypes:

void *
rte_cryptodev_get_accelerator_ctx(uint8_t dev_id);

void *
rte_compressdev_get_accelerator_ctx(uint8_t dev_id);

An application will call these functions to get the device accelerator context, and then use this context when it wants to interact with the device for accelerator functionality. The following are some of the main accelerator API functions provided. For full details of the parameters and return values, see the full API at the end of the RFC

	- The following functions are used to create and destroy an accelerator session on a device:

		struct rte_accelerator_session *
		rte_accelerator_session_create(struct rte_accelerator_ctx *ctx,
                		               struct rte_accelerator_xform *xform,
		                               int socket_id);

		int
		rte_accelerator_session_destroy(struct rte_accelerator_ctx *ctx,
		                                struct rte_accelerator_session *sess);

	- The following functions are used to enqueue and dequeue accelerator operations to/from a device
		- The qp_id parameter specifies the queue pair of the device on which to enqueue/dequeue
		- It is the responsibility of the application to manage the queue pair assignments within the application (e.g. the same queue pair should not be used for accelerator enqueue/dequeue and cryptodev enqueue/dequeue)

		uint16_t
		rte_accelerator_ops_enqueue(struct rte_accelerator_ctx *ctx,
		                            uint16_t qp_id,
		                            struct rte_accelerator_op **ops,
		                            uint16_t nb_ops);

		uint16_t
		rte_accelerator_ops_dequeue(struct rte_accelerator_ctx *ctx,
                		            uint16_t qp_id,
		                            struct rte_accelerator_op **ops,
                		            uint16_t nb_ops);


Full API
========

The following is the full proposed API for the rte_accelerator library. There are some minor updates to the rte_cryptodev and rte_compressdev API which are also listed.

diff --git a/lib/librte_accelerator/rte_accelerator.h b/lib/librte_accelerator/rte_accelerator.h
new file mode 100644
index 0000000..dcf292b
--- /dev/null
+++ b/lib/librte_accelerator/rte_accelerator.h
@@ -0,0 +1,336 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ACCELERATOR_H_
+#define _RTE_ACCELERATOR_H_
+
+/**
+ * @file rte_accelerator.h
+ *
+ * RTE Accelerator Common Definitions
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_compat.h>
+#include <rte_common.h>
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_mempool.h>
+#include <rte_comp.h>
+#include <rte_crypto.h>
+
+#include "rte_err_detect.h"
+
+/**
+ * Accelerator transform types
+ */
+enum rte_accelerator_xform_type {
+	RTE_ACC_XFORM_TYPE_CRYPTO_SYM,
+	/**< Symmetric crypto transform type */
+	RTE_ACC_XFORM_TYPE_CRYPTO_ASYM,
+	/**< Asymmetric crypto transform type */
+	RTE_ACC_XFORM_TYPE_COMP,
+	/**< Compression transform type */
+	RTE_ACC_XFORM_TYPE_ERR_DETECT,
+	/**< Error detection transform type */
+};
+
+/**
+ * Accelerator transform setup data
+ *
+ * This structure is used to specify the accelerator transforms required.
+ * Multiple transforms can be chained together to specify a chain of transforms
+ * such as symmetric crypto followed by error detection, or compression followed
+ * by symmetric crypto. Each transform structure holds a single transform, with
+ * the type field specifying which transform is contained within the union.
+ */
+struct rte_accelerator_xform {
+	struct rte_accelerator_xform *next;
+	/**<
+	 * Next transform in the chain
+	 * - the last transform in the chain MUST set this to NULL
+	 */
+	enum rte_accelerator_xform_type type;
+	/**< Transform type */
+
+	RTE_STD_C11
+	union {
+		struct rte_crypto_sym_xform crypto_sym;
+		/**< Symmetric crypto transform */
+		struct rte_crypto_asym_xform crypto_asym;
+		/**< Asymmetric crypto transform */
+		struct rte_comp_xform comp;
+		/**< Compression transform */
+		struct rte_err_detect_xform err_detect;
+		/**< Error detection transform */
+	};
+};
+
+/**
+ * Accelerator operation types
+ *
+ * Each value must be a power of 2 so that the operations can be combined into
+ * a bitmask (see rte_accelerator_op_pool_create())
+ */
+enum rte_accelerator_op_type {
+	RTE_ACCELERATOR_OP_TYPE_CRYPTO_SYM  = (0x1 << 0),
+	/**< Symmetric crypto operation type */
+	RTE_ACCELERATOR_OP_TYPE_CRYPTO_ASYM = (0x1 << 1),
+	/**< Asymmetric crypto operation type */
+	RTE_ACCELERATOR_OP_TYPE_COMP        = (0x1 << 2),
+	/**< Compression operation type */
+	RTE_ACCELERATOR_OP_TYPE_ERR_DETECT  = (0x1 << 3),
+	/**< Error detection operation type */
+};
+
+/**
+ * Accelerator operation status
+ */
+enum rte_accelerator_op_status {
+	RTE_ACCELERATOR_OP_STATUS_NOT_PROCESSED = 0,
+	/**< Operation has not yet been processed by a device */
+	RTE_ERR_DETECT_OP_STATUS_SUCCESS,
+	/**< Operation completed successfully */
+	RTE_ERR_DETECT_OP_STATUS_FAILURE,
+	/**< Operation completed with failure */
+};
+
+/**
+ * Accelerator operation data
+ *
+ * This structure is used to specify the operations for a particular session.
+ * This includes specifying the source and, if required, destination mbufs and
+ * the lengths and offsets of the data within these mbufs on which the
+ * operations should be done. Multiple operations are chained together to
+ * specify the full set of operations to be performed
+ *
+ * @note The rte_accelerator_op chain MUST match the session's xform
+ * chain exactly
+ * @note The first rte_accelerator_op element in the chain is the parent
+ * operation. The following fields MUST be set in this first operation before
+ * enqueuing and are ignored in the inner operations and any subsequent
+ * rte_accelerator_op chain elements:
+ * - *sess*
+ * - *m_src*
+ * - *m_dst* (if required)
+ * @note If *sess* or *m_src* is not set in the first rte_accelerator_op, this
+ * operation is invalid and will cause an error when attempting to enqueue.
+ * @note The following fields MUST be set in ALL rte_accelerator_op chain
+ * elements:
+ * - *next*
+ * - *mempool*
+ * - *type*
+ * @note After the operation has been dequeued, only the FIRST (i.e. the parent)
+ * rte_accelerator_op in the chain will contain the *overall_status*. Each
+ * chain element will contain it's individual *op_status*, the value of which is
+ * relevant to operation type (e.g. an ::rte_crypto_op_status,
+ * ::rte_comp_op_status or ::rte_err_detect_op_status)
+ */
+struct rte_accelerator_op {
+	struct rte_accelerator_op *next;
+	/**<
+	 * Next operation in the chain
+	 * - the last operation in the chain MUST set this to NULL
+	 */
+	struct rte_accelerator_session *sess;
+	/**< Handle for the associated accelerator session */
+
+	struct rte_mempool *mempool;
+	/**< Mempool from which the operation is allocated */
+
+	struct rte_mbuf *m_src; /**< Source mbuf */
+	struct rte_mbuf *m_dst; /**< Destination mbuf */
+
+	enum rte_accelerator_op_status overall_status;
+	/**<
+	 * Overall operation status
+	 * - indicates if all the operations in the chain succeeded or if any
+	 *   one of them failed
+	 */
+
+	uint8_t op_status;
+	/**<
+	 * Individual operation status
+	 * - indicates the status of the individual operation in the chain
+	 */
+
+	enum rte_accelerator_op_type type;
+	/**< Operation type */
+
+	RTE_STD_C11
+	union {
+		struct rte_crypto_sym_op crypto_sym;
+		/**< Symmetric crypto operation */
+		struct rte_crypto_asym_op crypto_asym;
+		/**< Asymmetric crypto operation */
+		struct rte_comp_op comp;
+		/**< Compression operation */
+		struct rte_err_detect_op err_detect;
+		/**< Error detection operation */
+	};
+};
+
+/**
+ * Accelerator context for a device
+ *
+ * Accelerator instance for each device driver to register their accelerator
+ * operations. The application can get the accelerator context from the
+ * underlying device using the API functions provided by the device's main API
+ * library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
+ * rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
+ */
+struct rte_accelerator_ctx;
+
+/**
+ * Accelerator session data
+ */
+struct rte_accelerator_session;
+
+/**
+ * Create accelerator session as specified by the transform chain
+ *
+ * @param   ctx		Accelerator device instance
+ * @param   xform	Pointer to the first element of the session transform
+ * 			chain
+ * @param   socket_id	Socket to allocate the session on
+ *
+ * @return
+ *  - Pointer to session, if successful
+ *  - NULL, on failure
+ */
+__rte_experimental
+struct rte_accelerator_session *
+rte_accelerator_session_create(struct rte_accelerator_ctx *ctx,
+			       struct rte_accelerator_xform *xform,
+			       int socket_id);
+
+/**
+ * Free accelerator session header and the session private data and return it
+ * to its original mempool
+ *
+ * @param   ctx		Accelerator device instance
+ * @param   sess	Accelerator session to be freed
+ *
+ * @return
+ *  - 0, if successful
+ *  - -EINVAL, if session is NULL
+ *  - -EBUSY, if not all device private data has been freed
+ */
+__rte_experimental
+int
+rte_accelerator_session_destroy(struct rte_accelerator_ctx *ctx,
+				struct rte_accelerator_session *sess);
+
+/**
+ * Creates an accelerator operation pool
+ *
+ * @param   name	Pool name
+ * @param   op_types	Bitmask of operations which this pool must support. This
+ * 			bitmask allows this function determine the maximum size
+ * 			of operation which must be accommodated in the mempool
+ * 			elements. See ::rte_accelerator_op_type for possible
+ * 			bitmask values
+ * @param   nb_elts	Number of elements in the pool
+ * @param   cache_size	Number of elements to cache on lcore, see
+ * 			rte_mempool_create() for further details about cache
+ * 			size
+ * @param   priv_size	Size of private data to allocate with each operation
+ * @param   socket_id	Socket to allocate the mempool on
+ *
+ * @return
+ *  - Pointer to mempool, if successful
+ *  - NULL, on failure
+ */
+__rte_experimental
+struct rte_mempool *
+rte_accelerator_op_pool_create(const char *name,
+			       uint32_t op_types,
+			       unsigned nb_elts,
+			       unsigned cache_size,
+			       uint16_t priv_size,
+			       int socket_id);
+
+/**
+ * Bulk allocate accelerator operations from a mempool
+ *
+ * @param   mempool	Accelerator operation mempool
+ * @param   ops		Array in which to place allocated accelerator operations
+ * @param   nb_ops	Number of accelerator operations to allocate
+ *
+ * @returns
+ *  - *nb_ops*, if the number of operations requested were allocated
+ *  - 0, if the requested number of ops are not available. None are allocated in
+ *    this case
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_op_bulk_alloc(struct rte_mempool *mempool,
+			      struct rte_accelerator_op **ops,
+			      uint16_t nb_ops);
+
+/**
+ * Free accelerator operation back to it's mempool
+ *
+ * @param   op		Accelerator operation
+ */
+__rte_experimental
+void
+rte_accelerator_op_free(struct rte_accelerator_op *op);
+
+/**
+ * Enqueue a burst of operations for processing on the specified queue pair
+ * of a device for processing
+ *
+ * @param   ctx		Accelerator device instance
+ * @param   qp_id	Index of the device's queue pair to which operations
+ *			are to be enqueued for processing
+ * @param   ops		Array of *nb_ops* pointers to accelerator operations
+ *			to be processed
+ * @param   nb_ops	Number of operations to process
+ *
+ * @return
+ *  The number of operations actually enqueued on the device. A return value
+ *  equal to *nb_ops* means that all operations have been enqueued. The return
+ *  value can be less than *nb_ops* when the device's queue is full or if
+ *  invalid parameters are specified in an rte_accelerator_op
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_ops_enqueue(struct rte_accelerator_ctx *ctx,
+			    uint16_t qp_id,
+			    struct rte_accelerator_op **ops,
+			    uint16_t nb_ops);
+
+/**
+ * Dequeue a burst of processed operations from a queue on the specified device.
+ * The dequeued operation are stored in rte_accelerator_op structures whose
+ * pointers are supplied in the *ops* array
+ *
+ * @param   ctx		Accelerator device instance
+ * @param   qp_id	Index of the device's queue pair from which processed
+ *			operation should be retrieved
+ * @param   ops		Array of pointers to rte_accelerator_op structures
+ *			that must be large enough to store *nb_ops* pointers in
+ *			it
+ * @param   nb_ops	Maximum number of operations to dequeue
+ *
+ * @return
+ *  The number of operations actually dequeued, which is the number of pointers
+ *  to rte_accelerator_op structures effectively supplied in the *ops* array
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_ops_dequeue(struct rte_accelerator_ctx *ctx,
+			    uint16_t qp_id,
+			    struct rte_accelerator_op **ops,
+			    uint16_t nb_ops);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ACCELERATOR_H_ */
diff --git a/lib/librte_accelerator/rte_accelerator_driver.h b/lib/librte_accelerator/rte_accelerator_driver.h
new file mode 100644
index 0000000..49cb902
--- /dev/null
+++ b/lib/librte_accelerator/rte_accelerator_driver.h
@@ -0,0 +1,146 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ACCELERATOR_DRIVER_H_
+#define _RTE_ACCELERATOR_DRIVER_H_
+
+/**
+ * @file rte_accelerator_driver.h
+ *
+ * RTE Accelerator Driver Definitions
+ *
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "rte_accelerator.h"
+
+/**
+ * Accelerator context for a device
+ *
+ * Accelerator instance for each driver to register their accelerator
+ * operations. The application can get the accelerator context from the
+ * underlying device using the API functions provided by the device's main API
+ * library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
+ * rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
+ */
+struct rte_accelerator_ctx {
+	void *device;
+	/**< Pointer to the device */
+	const struct rte_accelerator_ops *ops;
+	/**< Pointer to accelerator ops for the device */
+	uint16_t sess_cnt;
+	/**< Number of sessions attached to this context */
+};
+
+/**
+ * Accelerator session data
+ */
+struct rte_accelerator_session {
+	void *sess_private_data;
+	/**< Device's private session data */
+};
+
+/**
+ * Configure private accelerator session data on a device
+ *
+ * @param   device	Device pointer
+ * @param   xform	Pointer to the first element of the session transform
+ *			chain
+ * @param   sess	Accelerator session structure
+ * @param   socket_id	Socket to allocate the session on
+ *
+ * @return
+ *  - 0, if the private session structure have been created successfully
+ *  - -EINVAL, if the input parameters are invalid
+ *  - -ENOTSUP, if the device does not support the session configuration
+ *  - -ENOMEM, if memory for the session could not be allocated
+ */
+typedef int
+(*accelerator_session_create_t)(void *device,
+				struct rte_accelerator_xform *xform,
+				struct rte_accelerator_session *sess,
+				int socket_id);
+
+/**
+ * Free private accelerator session data
+ *
+ * @param   device	Device pointer
+ * @param   sess	Accelerator session structure
+ */
+typedef void
+(*accelerator_session_destroy_t)(void *device,
+				 struct rte_accelerator_session *sess);
+
+/**
+ * Get the size of private accelerator session data
+ *
+ * @param   device	Device pointer
+ *
+ * @return
+ *  - Size of the private session structure for device, if successful
+ *  - 0, if failure
+ */
+typedef unsigned int
+(*accelerator_session_private_size_get_t)(void *device);
+
+/**
+ * Enqueue operations on queue pair of a device for processing
+ *
+ * @param   device	Device pointer
+ * @param   qp_id	Index of the device's queue pair to which operations
+ *			are to be enqueued for processing
+ * @param   ops		Array of *nb_ops* pointers to operations to be enqueued
+ * @param   nb_ops	Number of operations to enqueue
+ *
+ * @return
+ *  The number of operations actually enqueued on the device
+ */
+typedef uint16_t
+(*accelerator_ops_enqueue_t)(void *device,
+			     uint16_t qp_id,
+			     struct rte_accelerator_op **ops,
+			     uint16_t nb_ops);
+
+/**
+ * Dequeue processed operations from a queue pair of a device
+ *
+ * @param   device	Device pointer
+ * @param   qp_id	Index of the device's queue pair from which processed
+ *			operations should be dequeued
+ * @param   ops		Array of pointers to rte_accelerator_op structures
+ *			that must be large enough to store *nb_ops* pointers in
+ *			it
+ * @param   nb_ops	Maximum number of operations to dequeue
+ *
+ * @return
+ *  The number of operations actually dequeued from the device
+ */
+typedef uint16_t
+(*accelerator_ops_dequeue_t)(void *device,
+			     uint16_t qp_id,
+			     struct rte_accelerator_op **ops,
+			     uint16_t nb_ops);
+
+/** Accelerator device operations function pointer table */
+struct rte_accelerator_ops {
+	accelerator_session_create_t session_create;
+	/**< Configure an accelerator device's private session data */
+	accelerator_session_destroy_t session_destroy;
+	/**< Free an accelerator device's private session data */
+	accelerator_session_private_size_get_t session_private_size_get;
+	/**< Get the size of an accelerator device's private session data */
+	accelerator_ops_enqueue_t ops_enqueue;
+	/**< Enqueue a burst of operations to a device */
+	accelerator_ops_dequeue_t ops_dequeue;
+	/**< Dequeue a burst of operations from a device */
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ACCELERATOR_DRIVER_H_ */
diff --git a/lib/librte_accelerator/rte_err_detect.h b/lib/librte_accelerator/rte_err_detect.h
new file mode 100644
index 0000000..f54ebfb
--- /dev/null
+++ b/lib/librte_accelerator/rte_err_detect.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ERR_DETECT_H_
+#define _RTE_ERR_DETECT_H_
+
+/**
+ * @file rte_err_detect.h
+ *
+ * RTE Error Detection Definitions
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdint.h>
+
+/** Error Detection Algorithms */
+enum rte_err_detect_algorithm {
+	RTE_ERR_DETECT_CRC32_ETH,
+	/**< CRC32 Ethernet */
+	RTE_ERR_DETECT_BIP32
+	/**< BIP32 */
+};
+
+/** Error Detection Operation Types */
+enum rte_err_detect_operation {
+	RTE_ERR_DETECT_OP_VERIFY,
+	/**< Verify error detection result */
+	RTE_ERR_DETECY_OP_GENERATE
+	/**< Generate error detection result */
+};
+
+/** Error Detection Status */
+enum rte_err_detect_op_status {
+	RTE_ERR_DETECT_OP_STATUS_NOT_PROCESSED,
+	/**< Operation has not yet been processed by a device */
+	RTE_ERR_DETECT_OP_STATUS_SUCCESS,
+	/**< Operation completed successfully */
+	RTE_ERR_DETECT_OP_STATUS_VERIFY_FAILED,
+	/**< Verification failed */
+	RTE_ERR_DETECT_OP_STATUS_ERROR
+	/**< Error handling operation */
+};
+
+/**
+ * Error Detection Transform Data
+ *
+ * This structure contains data relating to an error detection transform. The
+ * fields *op* and *algo* are common to all error detection transforms and
+ * MUST be set
+ */
+struct rte_err_detect_xform {
+	enum rte_err_detect_operation op;
+	/**< Error detection operation type */
+	enum rte_err_detect_algorithm algo;
+	/**< Error detection algorithm */
+};
+
+/** Error Detection Operation */
+struct rte_err_detect_op {
+	struct rte_mbuf *m_src; /**< Source mbuf */
+
+	enum rte_err_detect_op_status status; /**< Operation status */
+
+	struct {
+		uint16_t offset;
+		/**<
+		 * Starting point for error detection processing, specified
+		 * as the number of bytes from start of the packet in the
+		 * source mbuf
+		 */
+		uint16_t length;
+		/**<
+		 * The length, in bytes, of the source mbuf on which the error
+		 * detection operation will be computed
+		 */
+	} data; /**< Data offset and length for error detection */
+
+	struct {
+		uint8_t *data;
+		/**<
+		 * This points to the location where the error detection
+		 * result should be written (in the case of generation) or
+		 * where the purported result exists (in the case of
+		 * verification)
+		 *
+		 * The caller must ensure the required length of physically
+		 * contiguous memory is available at this address
+		 *
+		 * For a CRC, this may point into the mbuf packet data. For
+		 * an operation such as a BIP, this may point to a memory
+		 * location after the op
+		 *
+		 * For generation, the result will overwrite any data at this
+		 * location
+		 */
+		rte_iova_t phys_addr;
+		/**< Physical address of output data */
+	} output; /**< Output location */
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ERR_DETECT_H_ */
diff --git a/lib/librte_compressdev/rte_compressdev.h b/lib/librte_compressdev/rte_compressdev.h
index 8052efe..9e81bb9 100644
--- a/lib/librte_compressdev/rte_compressdev.h
+++ b/lib/librte_compressdev/rte_compressdev.h
@@ -568,6 +568,21 @@ __rte_experimental
 int
 rte_compressdev_private_xform_free(uint8_t dev_id, void *private_xform);
 
+/**
+ * Get accelerator context for a device.
+ *
+ * @param dev_id
+ *   Compress device identifier
+ *
+ * @return
+ *  - Pointer to the device's accelerator context, if the device supports
+ *    the accelerator API
+ *  - NULL, otherwise
+ */
+__rte_experimental
+void *
+rte_compressdev_get_accelerator_ctx(uint8_t dev_id);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_compressdev/rte_compressdev_internal.h b/lib/librte_compressdev/rte_compressdev_internal.h
index 22ceac6..549526e 100644
--- a/lib/librte_compressdev/rte_compressdev_internal.h
+++ b/lib/librte_compressdev/rte_compressdev_internal.h
@@ -79,6 +79,9 @@ struct rte_compressdev {
 	struct rte_device *device;
 	/**< Backing device */
 
+	void *accelerator_ctx;
+        /**< Context for accelerator ops */
+
 	__extension__
 	uint8_t attached : 1;
 	/**< Flag indicating the device is attached */
diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index c6ffa3b..7279f12 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -838,6 +838,9 @@ struct rte_cryptodev {
 	void *security_ctx;
 	/**< Context for security ops */
 
+	void *accelerator_ctx;
+	/**< Context for accelerator ops */
+
 	__extension__
 	uint8_t attached : 1;
 	/**< Flag indicating the device is attached */
@@ -847,6 +850,20 @@ void *
 rte_cryptodev_get_sec_ctx(uint8_t dev_id);
 
 /**
+ * Get accelerator context for a device
+ *
+ * @param	dev_id		Device id.
+ *
+ * @return
+ *  - Pointer to the device's accelerator context, if the device supports
+ *    the accelerator API
+ *  - NULL, otherwise
+ */
+__rte_experimental
+void *
+rte_cryptodev_get_accelerator_ctx(uint8_t dev_id);
+
+/**
  *
  * The data part, with no function pointers, associated with each device.
  *

--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-04 14:45 [dpdk-dev] [RFC] Accelerator API to chain packet processing functions David Coyle
@ 2020-02-04 19:52 ` Jerin Jacob
  2020-02-06 10:04   ` Coyle, David
  0 siblings, 1 reply; 24+ messages in thread
From: Jerin Jacob @ 2020-02-04 19:52 UTC (permalink / raw)
  To: David Coyle; +Cc: dpdk-dev, Doherty, Declan, Fiona Trahe

On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
>
> Introduction
> ============
>
> This RFC introduces a new DPDK library, rte_accelerator.
>
> The main aim of this library is to provide a flexible and extensible way of combining one or more packet-processing functions into a single operation, thereby allowing these to be performed in parallel in optimized software libraries or in a hardware accelerator. These functions can include cryptography, compression and CRC/checksum calculation, while others can potentially be added in the future. Performing these functions in parallel as a single operation can enable a significant performance improvement.
>
>
> Background
> ==========
>
> There are a number of byte-wise operations which are present and common across many access network data-plane pipelines, such as Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some prototyping has been done at Intel in relation to the 01.org access-network-dataplanes project to prove that a significant performance improvement is possible when such byte-wise operations are combined into a single pass of packet data processing. This performance boost has been prototyped for both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.


Could you share the relative performance numbers to show the gain?



>
> The prototypes used some protocol-specific modifications to the DPDK cryptodev library. In order to make this performance improvement consumable by network access equipment vendors, a more extensible and correct solution is required that can be upstreamed into DPDK.
>
> Hence, the introduction of rte_accelerator.
>
>
> Use Cases
> =========
>
> The primary use cases for this new library have already been mentioned. These are:
>
> - DOCSIS MAC: Crypto-CRC
>         - Order:
>                 - Downstream: CRC, Encrypt
>                 - Upstream: Decrypt, CRC
>         - Specifications:
>                 - Crypto: 128-bit AES-CFB encryption variant for DOCSIS as described in section 11.1 of DOCSIS 3.1 Security Specification (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
>                 - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
>
> - XGS-PON MAC: Crypto-CRC-BIP
>         - Order:
>                 - Downstream: CRC, Encrypt, BIP

I understand if the chain has two operations then it may possible to
have handcrafted SW code to do both operations in one pass.
I understand the spec is agnostic on a number of passes it does
require to enable the xfrom but To understand the SW/HW capability,
In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
or three passes in SW or one pass using HW?



>                 - Upstream: BIP, Decrypt, CRC
>         - Specifications:
>                 - Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode (AES-CTR), as described in [NIST SP800-38A].
>                 - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
>                 - BIP: 4-byte bit-interleaved even parity (BIP) field computed over the entire FS frame, refer to  ITU-T G.989.3, sections 8.1.1.5 and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-G.989.3-201510-I!!PDF-E)
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-04 19:52 ` Jerin Jacob
@ 2020-02-06 10:04   ` Coyle, David
  2020-02-06 10:54     ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-02-06 10:04 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

Hi Jerin,
Thanks for the comments. Please see replies below.

Kind Regards,
David

> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> >
> > Introduction
> > ============
> >
> > This RFC introduces a new DPDK library, rte_accelerator.
> >
> > The main aim of this library is to provide a flexible and extensible way of
> combining one or more packet-processing functions into a single operation,
> thereby allowing these to be performed in parallel in optimized software
> libraries or in a hardware accelerator. These functions can include
> cryptography, compression and CRC/checksum calculation, while others can
> potentially be added in the future. Performing these functions in parallel as a
> single operation can enable a significant performance improvement.
> >
> >
> > Background
> > ==========
> >
> > There are a number of byte-wise operations which are present and
> common across many access network data-plane pipelines, such as Cipher,
> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> prototyping has been done at Intel in relation to the 01.org access-network-
> dataplanes project to prove that a significant performance improvement is
> possible when such byte-wise operations are combined into a single pass of
> packet data processing. This performance boost has been prototyped for
> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> 
> 
> Could you share the relative performance numbers to show the gain?

[DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.

We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev

We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
	- XGS-PON MAC (Crypto-CRC-BIP):
		- 3 independent stages: 1429 cycles/buf (13.75Gbps)
		- 1 single-pass stage: 896 cycles/buf (21.9Gbps)
		37% cycle reduction

	- DOCSIS MAC (Crypto-CRC):
		- 2 independent stages: 1421 cycles/buf (13.84Gbps)
		- 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
		20% cycle reduction

Adding the accelerator API will allow vendors gain the benefits of these cycle savings

> 
> >
> > The prototypes used some protocol-specific modifications to the DPDK
> cryptodev library. In order to make this performance improvement
> consumable by network access equipment vendors, a more extensible and
> correct solution is required that can be upstreamed into DPDK.
> >
> > Hence, the introduction of rte_accelerator.
> >
> >
> > Use Cases
> > =========
> >
> > The primary use cases for this new library have already been mentioned.
> These are:
> >
> > - DOCSIS MAC: Crypto-CRC
> >         - Order:
> >                 - Downstream: CRC, Encrypt
> >                 - Upstream: Decrypt, CRC
> >         - Specifications:
> >                 - Crypto: 128-bit AES-CFB encryption variant for DOCSIS as
> described in section 11.1 of DOCSIS 3.1 Security Specification
> (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
> >                 - CRC: Ethernet 32-bit CRC as defined in
> > Ethernet/[ISO/IEC 8802-3]
> >
> > - XGS-PON MAC: Crypto-CRC-BIP
> >         - Order:
> >                 - Downstream: CRC, Encrypt, BIP
> 
> I understand if the chain has two operations then it may possible to have
> handcrafted SW code to do both operations in one pass.
> I understand the spec is agnostic on a number of passes it does require to
> enable the xfrom but To understand the SW/HW capability, In the above
> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> or one pass using HW?

[DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
However, this could also be performed as a single pass in a HW accelerator

> 
> 
> 
> >                 - Upstream: BIP, Decrypt, CRC
> >         - Specifications:
> >                 - Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode
> (AES-CTR), as described in [NIST SP800-38A].
> >                 - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
> >                 - BIP: 4-byte bit-interleaved even parity (BIP) field
> > computed over the entire FS frame, refer to  ITU-T G.989.3, sections 8.1.1.5
> and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-
> G.989.3-201510-I!!PDF-E)
> >
--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 10:04   ` Coyle, David
@ 2020-02-06 10:54     ` Jerin Jacob
  2020-02-06 16:31       ` Coyle, David
  2020-02-13 11:31       ` Doherty, Declan
  0 siblings, 2 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-06 10:54 UTC (permalink / raw)
  To: Coyle, David; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Hi Jerin,

Hi David,

> Thanks for the comments. Please see replies below.
>
> Kind Regards,
> David
>
> > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> > >
> > > Introduction
> > > ============
> > >
> > > This RFC introduces a new DPDK library, rte_accelerator.
> > >
> > > The main aim of this library is to provide a flexible and extensible way of
> > combining one or more packet-processing functions into a single operation,
> > thereby allowing these to be performed in parallel in optimized software
> > libraries or in a hardware accelerator. These functions can include
> > cryptography, compression and CRC/checksum calculation, while others can
> > potentially be added in the future. Performing these functions in parallel as a
> > single operation can enable a significant performance improvement.
> > >
> > >
> > > Background
> > > ==========
> > >
> > > There are a number of byte-wise operations which are present and
> > common across many access network data-plane pipelines, such as Cipher,
> > Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> > prototyping has been done at Intel in relation to the 01.org access-network-
> > dataplanes project to prove that a significant performance improvement is
> > possible when such byte-wise operations are combined into a single pass of
> > packet data processing. This performance boost has been prototyped for
> > both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >
> >
> > Could you share the relative performance numbers to show the gain?
>
> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>
> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>
> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
>         - XGS-PON MAC (Crypto-CRC-BIP):
>                 - 3 independent stages: 1429 cycles/buf (13.75Gbps)
>                 - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
>                 37% cycle reduction
>
>         - DOCSIS MAC (Crypto-CRC):
>                 - 2 independent stages: 1421 cycles/buf (13.84Gbps)
>                 - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
>                 20% cycle reduction
>
> Adding the accelerator API will allow vendors gain the benefits of these cycle savings

Numbers make sense. I have seen a similar performance improvement
doing in one pass with CPU instructions.


> > > - XGS-PON MAC: Crypto-CRC-BIP
> > >         - Order:
> > >                 - Downstream: CRC, Encrypt, BIP
> >
> > I understand if the chain has two operations then it may possible to have
> > handcrafted SW code to do both operations in one pass.
> > I understand the spec is agnostic on a number of passes it does require to
> > enable the xfrom but To understand the SW/HW capability, In the above
> > case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> > or one pass using HW?
>
> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> However, this could also be performed as a single pass in a HW accelerator

As a specification, cascading the xform chains make sense.
Do we have any HW that does support chaining the xforms more than
"two" in one pass?
i.e real chaining function where two blocks of HWs work hand in hand
for chaining.
If none, it may be better to abstract as synonymous API(No dequeue, no
enqueue) for the CPU use case.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 10:54     ` Jerin Jacob
@ 2020-02-06 16:31       ` Coyle, David
  2020-02-06 17:13         ` Jerin Jacob
  2020-02-13 11:31       ` Doherty, Declan
  1 sibling, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-02-06 16:31 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

Hi Jerin, see reply below

> On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi Jerin,
> 
> Hi David,
> 
> > Thanks for the comments. Please see replies below.
> >
> > Kind Regards,
> > David
> >
> > > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com>
> wrote:
> > > >
> > > > Introduction
> > > > ============
> > > >
> > > > This RFC introduces a new DPDK library, rte_accelerator.
> > > >
> > > > The main aim of this library is to provide a flexible and
> > > > extensible way of
> > > combining one or more packet-processing functions into a single
> > > operation, thereby allowing these to be performed in parallel in
> > > optimized software libraries or in a hardware accelerator. These
> > > functions can include cryptography, compression and CRC/checksum
> > > calculation, while others can potentially be added in the future.
> > > Performing these functions in parallel as a single operation can enable a
> significant performance improvement.
> > > >
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > There are a number of byte-wise operations which are present and
> > > common across many access network data-plane pipelines, such as
> > > Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other
> > > checksums etc. Some prototyping has been done at Intel in relation
> > > to the 01.org access-network- dataplanes project to prove that a
> > > significant performance improvement is possible when such byte-wise
> > > operations are combined into a single pass of packet data
> > > processing. This performance boost has been prototyped for both XGS-
> PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> > >
> > >
> > > Could you share the relative performance numbers to show the gain?
> >
> > [DC] As mentioned above, the main performance gains are when the
> packet processing operations can be combined into a single pass of the
> packet.
> > Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS
> MAC) have been implemented in the AESNI MB library as single pass
> operation chains.
> >
> > We have modified the dpdk-crypto-perf-tester as part of our prototyping
> to test the cases where:
> > 1) each packet processing function is done as an independent stage
> > (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for
> > cipher, and a C function to calculate the BIP)
> > 2) all packet processing functions done as a single-pass operation in
> > AESNI MB through rte_cryptodev
> >
> > We see the following results for 1024 byte input frames from dpdk-crypto-
> perf-tester:
> >         - XGS-PON MAC (Crypto-CRC-BIP):
> >                 - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> >                 - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> >                 37% cycle reduction
> >
> >         - DOCSIS MAC (Crypto-CRC):
> >                 - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> >                 - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> >                 20% cycle reduction
> >
> > Adding the accelerator API will allow vendors gain the benefits of
> > these cycle savings
> 
> Numbers make sense. I have seen a similar performance improvement doing
> in one pass with CPU instructions.
> 
> 
> > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > >         - Order:
> > > >                 - Downstream: CRC, Encrypt, BIP
> > >
> > > I understand if the chain has two operations then it may possible to
> > > have handcrafted SW code to do both operations in one pass.
> > > I understand the spec is agnostic on a number of passes it does
> > > require to enable the xfrom but To understand the SW/HW capability,
> > > In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> > > or three passes in SW or one pass using HW?
> >
> > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> library SW.
> > However, this could also be performed as a single pass in a HW
> > accelerator
> 
> As a specification, cascading the xform chains make sense.
> Do we have any HW that does support chaining the xforms more than "two"
> in one pass?
> i.e real chaining function where two blocks of HWs work hand in hand for
> chaining.
> If none, it may be better to abstract as synonymous API(No dequeue, no
> enqueue) for the CPU use case.

[DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 16:31       ` Coyle, David
@ 2020-02-06 17:13         ` Jerin Jacob
  2020-02-07 12:38           ` Coyle, David
  2020-02-13 11:44           ` Doherty, Declan
  0 siblings, 2 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-06 17:13 UTC (permalink / raw)
  To: Coyle, David; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:

Hi David,

> >
> >
> > > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > > >         - Order:
> > > > >                 - Downstream: CRC, Encrypt, BIP
> > > >
> > > > I understand if the chain has two operations then it may possible to
> > > > have handcrafted SW code to do both operations in one pass.
> > > > I understand the spec is agnostic on a number of passes it does
> > > > require to enable the xfrom but To understand the SW/HW capability,
> > > > In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> > > > or three passes in SW or one pass using HW?
> > >
> > > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> > library SW.
> > > However, this could also be performed as a single pass in a HW
> > > accelerator
> >
> > As a specification, cascading the xform chains make sense.
> > Do we have any HW that does support chaining the xforms more than "two"
> > in one pass?
> > i.e real chaining function where two blocks of HWs work hand in hand for
> > chaining.
> > If none, it may be better to abstract as synonymous API(No dequeue, no
> > enqueue) for the CPU use case.
>
> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.

There is a risk in drafting API that meant for HW without any HW
exists. Because there could be inefficiency on the metadata and fast
path API for both models.
For example, In the case of CPU based scheme, it will be pure overhead
emulate the "queue"(the enqueue and dequeue) for the sake of
abstraction where
CPU works better in the synchronous model and I have doubt that the
session-based scheme will work for HW or not as both difference  HW
needs to work hand in hand(IOMMU aspects for two PCI device)

Having said that, I agree with the need for use case and API for CPU
case. Till we find a HW spec, we need to make the solution as CPU
specific and latter extend based on HW metadata required.
Accelerator API sounds like HW accelerator and there is no HW support
then it may not good. We can change the API that works for the use
cases that we know how it works efficiently.







> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 17:13         ` Jerin Jacob
@ 2020-02-07 12:38           ` Coyle, David
  2020-02-07 14:18             ` Jerin Jacob
  2020-02-13 11:44           ` Doherty, Declan
  1 sibling, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-02-07 12:38 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

Hi Jerin, see below

> 
> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> wrote:
> 
> Hi David,
> 
> > >
> > >
> > > > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > > > >         - Order:
> > > > > >                 - Downstream: CRC, Encrypt, BIP
> > > > >
> > > > > I understand if the chain has two operations then it may
> > > > > possible to have handcrafted SW code to do both operations in one
> pass.
> > > > > I understand the spec is agnostic on a number of passes it does
> > > > > require to enable the xfrom but To understand the SW/HW
> > > > > capability, In the above case, "CRC, Encrypt, BIP", It is done
> > > > > in one pass in SW or three passes in SW or one pass using HW?
> > > >
> > > > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in
> > > > AESNI MB
> > > library SW.
> > > > However, this could also be performed as a single pass in a HW
> > > > accelerator
> > >
> > > As a specification, cascading the xform chains make sense.
> > > Do we have any HW that does support chaining the xforms more than
> "two"
> > > in one pass?
> > > i.e real chaining function where two blocks of HWs work hand in hand
> > > for chaining.
> > > If none, it may be better to abstract as synonymous API(No dequeue,
> > > no
> > > enqueue) for the CPU use case.
> >
> > [DC] I'm not aware of any HW that supports this at the moment, but that's
> not to say it couldn't in the future - if anyone else has any examples though,
> please feel free to share.
> > Regardless, I don't see why we would introduce a different API for SW
> devices and HW devices.
> 
> There is a risk in drafting API that meant for HW without any HW exists.
> Because there could be inefficiency on the metadata and fast path API for
> both models.
> For example, In the case of CPU based scheme, it will be pure overhead
> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> where CPU works better in the synchronous model and I have doubt that the
> session-based scheme will work for HW or not as both difference  HW needs
> to work hand in hand(IOMMU aspects for two PCI device)

[DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.

From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
the device type as such and therefore doesn't need to make different API calls. 

The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.

We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.

Finally, I'm unsure what you mean by this line:

	"I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"

What do mean by different HW working "hand in hand" and "two PCI device"?
The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain

> 
> Having said that, I agree with the need for use case and API for CPU case. Till
> we find a HW spec, we need to make the solution as CPU specific and latter
> extend based on HW metadata required.
> Accelerator API sounds like HW accelerator and there is no HW support then
> it may not good. We can change the API that works for the use cases that we
> know how it works efficiently.
> 
> 
> 
> 
> 
> 
> 
> > It would be up to each underlying PMD to decide if/how it supports a
> > particular accelerator xform chain, but from an application's point of
> > view, the accelerator API is always the same
> >
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-07 12:38           ` Coyle, David
@ 2020-02-07 14:18             ` Jerin Jacob
  2020-02-07 20:34               ` Stephen Hemminger
  2020-02-13 11:50               ` Doherty, Declan
  0 siblings, 2 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-07 14:18 UTC (permalink / raw)
  To: Coyle, David; +Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona

On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Hi Jerin, see below

Hi David,

> >
> > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > wrote:
> >

> >
> > There is a risk in drafting API that meant for HW without any HW exists.
> > Because there could be inefficiency on the metadata and fast path API for
> > both models.
> > For example, In the case of CPU based scheme, it will be pure overhead
> > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > where CPU works better in the synchronous model and I have doubt that the
> > session-based scheme will work for HW or not as both difference  HW needs
> > to work hand in hand(IOMMU aspects for two PCI device)
>
> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.

In this release, we added CPU crypto support in cryptodev to support
the synchronous model to fix the overhead.

>
> From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
> the device type as such and therefore doesn't need to make different API calls.
>
> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
>
> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
>
> Finally, I'm unsure what you mean by this line:
>
>         "I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"
>
> What do mean by different HW working "hand in hand" and "two PCI device"?
> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain

I was thinking, it will be N PCIe devices that create the chain. Each
distinct PCI device does the fixed-function and chains them together.

I do understand the usage of QAT HW and CRC in SW.
So If I understand it correctly, in rte_security, we are combining
rte_ethdev and rte_cryptodev. With this spec, we are trying to
combine,
rte_cryptodev and rte_compressdev. So it looks good to me. My only
remaining concern is the name of this API, accelerator too generic
name. IMO, like rte_security, we may need to give more meaningful name
for the use case where crytodev and compressdev can work together.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-07 14:18             ` Jerin Jacob
@ 2020-02-07 20:34               ` Stephen Hemminger
  2020-02-08  7:22                 ` Jerin Jacob
  2020-02-13 11:50               ` Doherty, Declan
  1 sibling, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2020-02-07 20:34 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Coyle, David, dpdk-dev, Doherty, Declan, Trahe, Fiona

On Fri, 7 Feb 2020 19:48:17 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi Jerin, see below  
> 
> Hi David,
> 
> > >
> > > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > > wrote:
> > >  
> 
> > >
> > > There is a risk in drafting API that meant for HW without any HW exists.
> > > Because there could be inefficiency on the metadata and fast path API for
> > > both models.
> > > For example, In the case of CPU based scheme, it will be pure overhead
> > > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > > where CPU works better in the synchronous model and I have doubt that the
> > > session-based scheme will work for HW or not as both difference  HW needs
> > > to work hand in hand(IOMMU aspects for two PCI device)  
> >
> > [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> > In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> > The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.  
> 
> In this release, we added CPU crypto support in cryptodev to support
> the synchronous model to fix the overhead.
> 
> >
> > From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
> > the device type as such and therefore doesn't need to make different API calls.
> >
> > The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> > As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> > is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> > pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> >
> > We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> >
> > Finally, I'm unsure what you mean by this line:
> >
> >         "I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> >
> > What do mean by different HW working "hand in hand" and "two PCI device"?
> > The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain  
> 
> I was thinking, it will be N PCIe devices that create the chain. Each
> distinct PCI device does the fixed-function and chains them together.
> 
> I do understand the usage of QAT HW and CRC in SW.
> So If I understand it correctly, in rte_security, we are combining
> rte_ethdev and rte_cryptodev. With this spec, we are trying to
> combine,
> rte_cryptodev and rte_compressdev. So it looks good to me. My only
> remaining concern is the name of this API, accelerator too generic
> name. IMO, like rte_security, we may need to give more meaningful name
> for the use case where crytodev and compressdev can work together.

Having an API that could be used by parallel hardware does make sense,
but the DPDK already has multiple packet processing infrastructure pieces.

I would rather the DPDK converge on one widely used, robust and tested packet
method. Rather than the current "choose your poison or roll your own" which is
what we have now. The proposed graph seems to be the best so far.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-07 20:34               ` Stephen Hemminger
@ 2020-02-08  7:22                 ` Jerin Jacob
  2020-03-05 17:01                   ` Coyle, David
  0 siblings, 1 reply; 24+ messages in thread
From: Jerin Jacob @ 2020-02-08  7:22 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Coyle, David, dpdk-dev, Doherty, Declan, Trahe, Fiona, Prasun Kapoor

On Sat, Feb 8, 2020 at 2:04 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Fri, 7 Feb 2020 19:48:17 +0530
> Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> > On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> > >
> > > Hi Jerin, see below
> >
> > Hi David,
> >
> > > >
> > > > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > > > wrote:
> > > >
> >
> > > >
> > > > There is a risk in drafting API that meant for HW without any HW exists.
> > > > Because there could be inefficiency on the metadata and fast path API for
> > > > both models.
> > > > For example, In the case of CPU based scheme, it will be pure overhead
> > > > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > > > where CPU works better in the synchronous model and I have doubt that the
> > > > session-based scheme will work for HW or not as both difference  HW needs
> > > > to work hand in hand(IOMMU aspects for two PCI device)
> > >
> > > [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> > > In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> > > The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
> >
> > In this release, we added CPU crypto support in cryptodev to support
> > the synchronous model to fix the overhead.
> >
> > >
> > > From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
> > > the device type as such and therefore doesn't need to make different API calls.
> > >
> > > The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> > > As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> > > is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> > > pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> > >
> > > We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> > >
> > > Finally, I'm unsure what you mean by this line:
> > >
> > >         "I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> > >
> > > What do mean by different HW working "hand in hand" and "two PCI device"?
> > > The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
> >
> > I was thinking, it will be N PCIe devices that create the chain. Each
> > distinct PCI device does the fixed-function and chains them together.
> >
> > I do understand the usage of QAT HW and CRC in SW.
> > So If I understand it correctly, in rte_security, we are combining
> > rte_ethdev and rte_cryptodev. With this spec, we are trying to
> > combine,
> > rte_cryptodev and rte_compressdev. So it looks good to me. My only
> > remaining concern is the name of this API, accelerator too generic
> > name. IMO, like rte_security, we may need to give more meaningful name
> > for the use case where crytodev and compressdev can work together.
>
> Having an API that could be used by parallel hardware does make sense,
> but the DPDK already has multiple packet processing infrastructure pieces.
>
> I would rather the DPDK converge on one widely used, robust and tested packet
> method. Rather than the current "choose your poison or roll your own" which is
> what we have now. The proposed graph seems to be the best so far.

I agree. Even I thought of saying graph can do this, as, it has higher
abstraction and runtime chaining support, but then I thought it will
be self markering.
David could you check https://www.mail-archive.com/dev@dpdk.org/msg156318.html
If this one only focusing crypto dev + compressdev, What if we have
ethdev + compressdev + security device in the future.
graph has higher abstraction so it can accommodate ANY chaining
requirements. i.e AESNI-MB + QAT will go as a separate node

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 10:54     ` Jerin Jacob
  2020-02-06 16:31       ` Coyle, David
@ 2020-02-13 11:31       ` Doherty, Declan
  2020-02-18  5:12         ` Jerin Jacob
  1 sibling, 1 reply; 24+ messages in thread
From: Doherty, Declan @ 2020-02-13 11:31 UTC (permalink / raw)
  To: Jerin Jacob, Coyle, David; +Cc: dpdk-dev, Trahe, Fiona

On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
>>
>> Hi Jerin,
> 
> Hi David,
> 
>> Thanks for the comments. Please see replies below.
>>
>> Kind Regards,
>> David
>>
>>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
>>>>
>>>> Introduction
>>>> ============
>>>>
>>>> This RFC introduces a new DPDK library, rte_accelerator.
>>>>
>>>> The main aim of this library is to provide a flexible and extensible way of
>>> combining one or more packet-processing functions into a single operation,
>>> thereby allowing these to be performed in parallel in optimized software
>>> libraries or in a hardware accelerator. These functions can include
>>> cryptography, compression and CRC/checksum calculation, while others can
>>> potentially be added in the future. Performing these functions in parallel as a
>>> single operation can enable a significant performance improvement.
>>>>
>>>>
>>>> Background
>>>> ==========
>>>>
>>>> There are a number of byte-wise operations which are present and
>>> common across many access network data-plane pipelines, such as Cipher,
>>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
>>> prototyping has been done at Intel in relation to the 01.org access-network-
>>> dataplanes project to prove that a significant performance improvement is
>>> possible when such byte-wise operations are combined into a single pass of
>>> packet data processing. This performance boost has been prototyped for
>>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
>>>
>>>
>>> Could you share the relative performance numbers to show the gain?
>>
>> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
>> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>>
>> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
>> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
>> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>>
>> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
>>          - XGS-PON MAC (Crypto-CRC-BIP):
>>                  - 3 independent stages: 1429 cycles/buf (13.75Gbps)
>>                  - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
>>                  37% cycle reduction
>>
>>          - DOCSIS MAC (Crypto-CRC):
>>                  - 2 independent stages: 1421 cycles/buf (13.84Gbps)
>>                  - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
>>                  20% cycle reduction
>>
>> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
> 
> Numbers make sense. I have seen a similar performance improvement
> doing in one pass with CPU instructions.
> 
> 
>>>> - XGS-PON MAC: Crypto-CRC-BIP
>>>>          - Order:
>>>>                  - Downstream: CRC, Encrypt, BIP
>>>
>>> I understand if the chain has two operations then it may possible to have
>>> handcrafted SW code to do both operations in one pass.
>>> I understand the spec is agnostic on a number of passes it does require to
>>> enable the xfrom but To understand the SW/HW capability, In the above
>>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
>>> or one pass using HW?
>>
>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
>> However, this could also be performed as a single pass in a HW accelerator
> 
> As a specification, cascading the xform chains make sense.
> Do we have any HW that does support chaining the xforms more than
> "two" in one pass?
> i.e real chaining function where two blocks of HWs work hand in hand
> for chaining.
> If none, it may be better to abstract as synonymous API(No dequeue, no
> enqueue) for the CPU use case.
> 

Where you thinking along the lines of a synchronous API option like that 
just introduced to crytodev? i.e something like

uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
				 struct rte_accelerator_op ops[],
				 uint16_t nb_ops);



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-06 17:13         ` Jerin Jacob
  2020-02-07 12:38           ` Coyle, David
@ 2020-02-13 11:44           ` Doherty, Declan
  2020-02-18  5:30             ` Jerin Jacob
  1 sibling, 1 reply; 24+ messages in thread
From: Doherty, Declan @ 2020-02-13 11:44 UTC (permalink / raw)
  To: Jerin Jacob, Coyle, David; +Cc: dpdk-dev, Trahe, Fiona

On 06/02/2020 5:13 PM, Jerin Jacob wrote:
> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:
> 
> Hi David,
> 
>>>
>>>
>>>>>> - XGS-PON MAC: Crypto-CRC-BIP
>>>>>>          - Order:
>>>>>>                  - Downstream: CRC, Encrypt, BIP
>>>>>
>>>>> I understand if the chain has two operations then it may possible to
>>>>> have handcrafted SW code to do both operations in one pass.
>>>>> I understand the spec is agnostic on a number of passes it does
>>>>> require to enable the xfrom but To understand the SW/HW capability,
>>>>> In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
>>>>> or three passes in SW or one pass using HW?
>>>>
>>>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
>>> library SW.
>>>> However, this could also be performed as a single pass in a HW
>>>> accelerator
>>>
>>> As a specification, cascading the xform chains make sense.
>>> Do we have any HW that does support chaining the xforms more than "two"
>>> in one pass?
>>> i.e real chaining function where two blocks of HWs work hand in hand for
>>> chaining.
>>> If none, it may be better to abstract as synonymous API(No dequeue, no
>>> enqueue) for the CPU use case.
>>
>> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
>> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
> 
> There is a risk in drafting API that meant for HW without any HW
> exists. Because there could be inefficiency on the metadata and fast
> path API for both models.
> For example, In the case of CPU based scheme, it will be pure overhead
> emulate the "queue"(the enqueue and dequeue) for the sake of
> abstraction where
> CPU works better in the synchronous model and I have doubt that the
> session-based scheme will work for HW or not as both difference  HW
> needs to work hand in hand(IOMMU aspects for two PCI device)

We do have some proto-types in hardware which can do operation chaining 
but in the case we have looked at, it is a single accelerator device 
with multi-function which means the orchestration (order, passing of 
data etc) of the chained operations is handled within the device itself, 
meaning that we didn't see issues with shared session data or handling 
moving data along discrete independent stage of a hardware pipeline 
wasn't an issue.

Although if you wanted to offer this type of chained offload, I think we 
would need the driver to handle this for the user, rather than the 
application needing to understand how the hardware pipeline is interacting.

> 
> Having said that, I agree with the need for use case and API for CPU
> case. Till we find a HW spec, we need to make the solution as CPU
> specific and latter extend based on HW metadata required.
> Accelerator API sounds like HW accelerator and there is no HW support
> then it may not good. We can change the API that works for the use
> cases that we know how it works efficiently.
> 
> 
> 
> 
> 
> 
> 
>> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-07 14:18             ` Jerin Jacob
  2020-02-07 20:34               ` Stephen Hemminger
@ 2020-02-13 11:50               ` Doherty, Declan
  2020-02-18  5:15                 ` Jerin Jacob
  1 sibling, 1 reply; 24+ messages in thread
From: Doherty, Declan @ 2020-02-13 11:50 UTC (permalink / raw)
  To: Jerin Jacob, Coyle, David; +Cc: dpdk-dev, Trahe, Fiona

On 07/02/2020 2:18 PM, Jerin Jacob wrote:
> On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
>>
>> Hi Jerin, see below
> 
> Hi David,
> 
>>>
>>> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
>>> wrote:
>>>
> 
>>>
>>> There is a risk in drafting API that meant for HW without any HW exists.
>>> Because there could be inefficiency on the metadata and fast path API for
>>> both models.
>>> For example, In the case of CPU based scheme, it will be pure overhead
>>> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
>>> where CPU works better in the synchronous model and I have doubt that the
>>> session-based scheme will work for HW or not as both difference  HW needs
>>> to work hand in hand(IOMMU aspects for two PCI device)
>>
>> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
>> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
>> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
> 
> In this release, we added CPU crypto support in cryptodev to support
> the synchronous model to fix the overhead.
> 
>>
>>  From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
>> the device type as such and therefore doesn't need to make different API calls.
>>
>> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
>> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
>> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
>> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
>>
>> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
>>
>> Finally, I'm unsure what you mean by this line:
>>
>>          "I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"
>>
>> What do mean by different HW working "hand in hand" and "two PCI device"?
>> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
> 
> I was thinking, it will be N PCIe devices that create the chain. Each
> distinct PCI device does the fixed-function and chains them together.
> 

The case we were looking at is more focused on a single  discrete 
(multi-function) device (from the perspective of the host) providing a 
number of transforms (operations) in a single pass rather than the case 
of N discrete hardware devices (from the perspective of the host) 
chained together to achieve the same transforms set.


> I do understand the usage of QAT HW and CRC in SW.
> So If I understand it correctly, in rte_security, we are combining
> rte_ethdev and rte_cryptodev. With this spec, we are trying to
> combine,
> rte_cryptodev and rte_compressdev. So it looks good to me. My only
> remaining concern is the name of this API, accelerator too generic
> name. IMO, like rte_security, we may need to give more meaningful name
> for the use case where crytodev and compressdev can work together.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-13 11:31       ` Doherty, Declan
@ 2020-02-18  5:12         ` Jerin Jacob
  0 siblings, 0 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-18  5:12 UTC (permalink / raw)
  To: Doherty, Declan; +Cc: Coyle, David, dpdk-dev, Trahe, Fiona

On Thu, Feb 13, 2020 at 5:01 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> > On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
> >>
> >> Hi Jerin,
> >
> > Hi David,
> >
> >> Thanks for the comments. Please see replies below.
> >>
> >> Kind Regards,
> >> David
> >>
> >>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> >>>>
> >>>> Introduction
> >>>> ============
> >>>>
> >>>> This RFC introduces a new DPDK library, rte_accelerator.
> >>>>
> >>>> The main aim of this library is to provide a flexible and extensible way of
> >>> combining one or more packet-processing functions into a single operation,
> >>> thereby allowing these to be performed in parallel in optimized software
> >>> libraries or in a hardware accelerator. These functions can include
> >>> cryptography, compression and CRC/checksum calculation, while others can
> >>> potentially be added in the future. Performing these functions in parallel as a
> >>> single operation can enable a significant performance improvement.
> >>>>
> >>>>
> >>>> Background
> >>>> ==========
> >>>>
> >>>> There are a number of byte-wise operations which are present and
> >>> common across many access network data-plane pipelines, such as Cipher,
> >>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> >>> prototyping has been done at Intel in relation to the 01.org access-network-
> >>> dataplanes project to prove that a significant performance improvement is
> >>> possible when such byte-wise operations are combined into a single pass of
> >>> packet data processing. This performance boost has been prototyped for
> >>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >>>
> >>>
> >>> Could you share the relative performance numbers to show the gain?
> >>
> >> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> >> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
> >>
> >> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> >> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> >> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
> >>
> >> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
> >>          - XGS-PON MAC (Crypto-CRC-BIP):
> >>                  - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> >>                  - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> >>                  37% cycle reduction
> >>
> >>          - DOCSIS MAC (Crypto-CRC):
> >>                  - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> >>                  - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> >>                  20% cycle reduction
> >>
> >> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
> >
> > Numbers make sense. I have seen a similar performance improvement
> > doing in one pass with CPU instructions.
> >
> >
> >>>> - XGS-PON MAC: Crypto-CRC-BIP
> >>>>          - Order:
> >>>>                  - Downstream: CRC, Encrypt, BIP
> >>>
> >>> I understand if the chain has two operations then it may possible to have
> >>> handcrafted SW code to do both operations in one pass.
> >>> I understand the spec is agnostic on a number of passes it does require to
> >>> enable the xfrom but To understand the SW/HW capability, In the above
> >>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> >>> or one pass using HW?
> >>
> >> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> >> However, this could also be performed as a single pass in a HW accelerator
> >
> > As a specification, cascading the xform chains make sense.
> > Do we have any HW that does support chaining the xforms more than
> > "two" in one pass?
> > i.e real chaining function where two blocks of HWs work hand in hand
> > for chaining.
> > If none, it may be better to abstract as synonymous API(No dequeue, no
> > enqueue) for the CPU use case.
> >
>
> Where you thinking along the lines of a synchronous API option like that
> just introduced to crytodev? i.e something like
>
> uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
>                                  struct rte_accelerator_op ops[],
>                                  uint16_t nb_ops);

Yes. May be with capability or preference to denote application for
the preferred usage model.

>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-13 11:50               ` Doherty, Declan
@ 2020-02-18  5:15                 ` Jerin Jacob
  0 siblings, 0 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-18  5:15 UTC (permalink / raw)
  To: Doherty, Declan; +Cc: Coyle, David, dpdk-dev, Trahe, Fiona

On Thu, Feb 13, 2020 at 5:20 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 07/02/2020 2:18 PM, Jerin Jacob wrote:
> > On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> >>
> >> Hi Jerin, see below
> >
> > Hi David,
> >
> >>>
> >>> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> >>> wrote:
> >>>
> >
> >>>
> >>> There is a risk in drafting API that meant for HW without any HW exists.
> >>> Because there could be inefficiency on the metadata and fast path API for
> >>> both models.
> >>> For example, In the case of CPU based scheme, it will be pure overhead
> >>> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> >>> where CPU works better in the synchronous model and I have doubt that the
> >>> session-based scheme will work for HW or not as both difference  HW needs
> >>> to work hand in hand(IOMMU aspects for two PCI device)
> >>
> >> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> >> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> >> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
> >
> > In this release, we added CPU crypto support in cryptodev to support
> > the synchronous model to fix the overhead.
> >
> >>
> >>  From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability -  the application doesn't need to know
> >> the device type as such and therefore doesn't need to make different API calls.
> >>
> >> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> >> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> >> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> >> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> >>
> >> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> >>
> >> Finally, I'm unsure what you mean by this line:
> >>
> >>          "I have doubt that the session-based scheme will work for HW or not as both difference  HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> >>
> >> What do mean by different HW working "hand in hand" and "two PCI device"?
> >> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
> >
> > I was thinking, it will be N PCIe devices that create the chain. Each
> > distinct PCI device does the fixed-function and chains them together.
> >
>
> The case we were looking at is more focused on a single  discrete
> (multi-function) device (from the perspective of the host) providing a
> number of transforms (operations) in a single pass rather than the case
> of N discrete hardware devices (from the perspective of the host)
> chained together to achieve the same transforms set.

Yes. For multifunction PCI device, it will be easy as it has a single
DMA stream aka single IOMMU context.
DMA between Two discrete hardware devices may need additional semantics.



>
>
> > I do understand the usage of QAT HW and CRC in SW.
> > So If I understand it correctly, in rte_security, we are combining
> > rte_ethdev and rte_cryptodev. With this spec, we are trying to
> > combine,
> > rte_cryptodev and rte_compressdev. So it looks good to me. My only
> > remaining concern is the name of this API, accelerator too generic
> > name. IMO, like rte_security, we may need to give more meaningful name
> > for the use case where crytodev and compressdev can work together.
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-13 11:44           ` Doherty, Declan
@ 2020-02-18  5:30             ` Jerin Jacob
  0 siblings, 0 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-02-18  5:30 UTC (permalink / raw)
  To: Doherty, Declan; +Cc: Coyle, David, dpdk-dev, Trahe, Fiona

On Thu, Feb 13, 2020 at 5:14 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 06/02/2020 5:13 PM, Jerin Jacob wrote:
> > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi David,
> >
> >>>
> >>>
> >>>>>> - XGS-PON MAC: Crypto-CRC-BIP
> >>>>>>          - Order:
> >>>>>>                  - Downstream: CRC, Encrypt, BIP
> >>>>>
> >>>>> I understand if the chain has two operations then it may possible to
> >>>>> have handcrafted SW code to do both operations in one pass.
> >>>>> I understand the spec is agnostic on a number of passes it does
> >>>>> require to enable the xfrom but To understand the SW/HW capability,
> >>>>> In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> >>>>> or three passes in SW or one pass using HW?
> >>>>
> >>>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> >>> library SW.
> >>>> However, this could also be performed as a single pass in a HW
> >>>> accelerator
> >>>
> >>> As a specification, cascading the xform chains make sense.
> >>> Do we have any HW that does support chaining the xforms more than "two"
> >>> in one pass?
> >>> i.e real chaining function where two blocks of HWs work hand in hand for
> >>> chaining.
> >>> If none, it may be better to abstract as synonymous API(No dequeue, no
> >>> enqueue) for the CPU use case.
> >>
> >> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
> >> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
> >
> > There is a risk in drafting API that meant for HW without any HW
> > exists. Because there could be inefficiency on the metadata and fast
> > path API for both models.
> > For example, In the case of CPU based scheme, it will be pure overhead
> > emulate the "queue"(the enqueue and dequeue) for the sake of
> > abstraction where
> > CPU works better in the synchronous model and I have doubt that the
> > session-based scheme will work for HW or not as both difference  HW
> > needs to work hand in hand(IOMMU aspects for two PCI device)
>
> We do have some proto-types in hardware which can do operation chaining
> but in the case we have looked at, it is a single accelerator device
> with multi-function which means the orchestration (order, passing of
> data etc) of the chained operations is handled within the device itself,
> meaning that we didn't see issues with shared session data or handling
> moving data along discrete independent stage of a hardware pipeline
> wasn't an issue.
>
> Although if you wanted to offer this type of chained offload, I think we
> would need the driver to handle this for the user, rather than the
> application needing to understand how the hardware pipeline is interacting.

Yes. The application should not understand the specifics.

The only question how to make this generic so that any hardware/SW
pipeline can work.

Currently, we have rte_security, which works on ethdev and cryptodev.
This  new spec
is going to work on rte_cryptodev and rte_compressdev. If so, we need another
pipeline which needs to work with rte_cryptodev, rte_compressdev and ethdev then
we need to invent a new library.

I agree with the need for the hardware/SW pipeline.

As Stephen suggested, Why not look for general abstraction for HW/SW
based pipeline.
Marvell had a similar problem in abstracting various HW/SW pipeline,
Here is a proposal
for a generic HW/SW pipeline.

http://mails.dpdk.org/archives/dev/2020-January/156765.html

If the focus only for a specific case, say "CRC + something else",
better to have API for that and
better to not call the accelerator for the packet processing pipeline
as it has a big scope.

Just my 2c.

>
> >
> > Having said that, I agree with the need for use case and API for CPU
> > case. Till we find a HW spec, we need to make the solution as CPU
> > specific and latter extend based on HW metadata required.
> > Accelerator API sounds like HW accelerator and there is no HW support
> > then it may not good. We can change the API that works for the use
> > cases that we know how it works efficiently.
> >
> >
> >
> >
> >
> >
> >
> >> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
> >>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-02-08  7:22                 ` Jerin Jacob
@ 2020-03-05 17:01                   ` Coyle, David
  2020-03-06  8:43                     ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-03-05 17:01 UTC (permalink / raw)
  To: Jerin Jacob, Stephen Hemminger
  Cc: dpdk-dev, Doherty, Declan, Trahe, Fiona, Prasun Kapoor

> >
> > Having an API that could be used by parallel hardware does make sense,
> > but the DPDK already has multiple packet processing infrastructure pieces.
> >
> > I would rather the DPDK converge on one widely used, robust and tested
> > packet method. Rather than the current "choose your poison or roll
> > your own" which is what we have now. The proposed graph seems to be
> the best so far.
> 
> I agree. Even I thought of saying graph can do this, as, it has higher
> abstraction and runtime chaining support, but then I thought it will be self
> markering.
> David could you check https://www.mail-
> archive.com/dev@dpdk.org/msg156318.html
> If this one only focusing crypto dev + compressdev, What if we have ethdev
> + compressdev + security device in the future.
> graph has higher abstraction so it can accommodate ANY chaining
> requirements. i.e AESNI-MB + QAT will go as a separate node

[DC] We have looked at the graph node library and we don't feel that using graph is the correct solution for what we are trying to solve here.
We want to combine 2 or more packet processing functions on a packet into a single operation on a single device, be that an optimized software library such as AESNI MB or a hardware accelerator such as QAT
So yes, these 2 packet processing functions could be a node (or nodes) within a graph.
However they would still need to be combined together at some point to be processed on the device as a single operation.

Our new proposal is to use rte_rawdev to access the devices and we propose to add a "multi-function" interface which the application and rawdev PMDs will use to create the xform chains, sessions and op chains.
The full details on this new proposal have been sent to you in a separate post and we feel it addresses the concerns of the original rte_accelerator API

In the future, rawdev enqueue/dequeue calls using this multi-function interface could potentially be configured as a node within a packet processing graph


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-05 17:01                   ` Coyle, David
@ 2020-03-06  8:43                     ` Jerin Jacob
  0 siblings, 0 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-03-06  8:43 UTC (permalink / raw)
  To: Coyle, David
  Cc: Stephen Hemminger, dpdk-dev, Doherty, Declan, Trahe, Fiona,
	Prasun Kapoor

On Thu, Mar 5, 2020 at 10:32 PM Coyle, David <david.coyle@intel.com> wrote:
>
> > >
> > > Having an API that could be used by parallel hardware does make sense,
> > > but the DPDK already has multiple packet processing infrastructure pieces.
> > >
> > > I would rather the DPDK converge on one widely used, robust and tested
> > > packet method. Rather than the current "choose your poison or roll
> > > your own" which is what we have now. The proposed graph seems to be
> > the best so far.
> >
> > I agree. Even I thought of saying graph can do this, as, it has higher
> > abstraction and runtime chaining support, but then I thought it will be self
> > markering.
> > David could you check https://www.mail-
> > archive.com/dev@dpdk.org/msg156318.html
> > If this one only focusing crypto dev + compressdev, What if we have ethdev
> > + compressdev + security device in the future.
> > graph has higher abstraction so it can accommodate ANY chaining
> > requirements. i.e AESNI-MB + QAT will go as a separate node
>
> [DC] We have looked at the graph node library and we don't feel that using graph is the correct solution for what we are trying to solve here.
> We want to combine 2 or more packet processing functions on a packet into a single operation on a single device, be that an optimized software library such as AESNI MB or a hardware accelerator such as QAT
> So yes, these 2 packet processing functions could be a node (or nodes) within a graph.
> However they would still need to be combined together at some point to be processed on the device as a single operation.

It is possible with graph architecture where, when application gives
two nodes and they are connected then it will be possible to replace
to an optimized node at runtime.
Having said that, It is  NOT good to dedicate the graph as THE MODEL
for pipeline. i,e user should need to get both options rather sticking
to one model.
In that way, it makes sense to not abstract as a graph.

>
> Our new proposal is to use rte_rawdev to access the devices and we propose to add a "multi-function" interface which the application and rawdev PMDs will use to create the xform chains, sessions and op chains.
> The full details on this new proposal have been sent to you in a separate post and we feel it addresses the concerns of the original rte_accelerator API
>
> In the future, rawdev enqueue/dequeue calls using this multi-function interface could potentially be configured as a node within a packet processing graph

Yes. Makes sense.

>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-13 18:00       ` Coyle, David
@ 2020-03-13 18:03         ` Jerin Jacob
  0 siblings, 0 replies; 24+ messages in thread
From: Jerin Jacob @ 2020-03-13 18:03 UTC (permalink / raw)
  To: Coyle, David
  Cc: Shreyansh Jain, Hemant Agrawal, dev, stephen, Doherty, Declan,
	Trahe, Fiona, Ryan, Brendan, O'loingsigh, Mairtin

On Fri, Mar 13, 2020 at 11:30 PM Coyle, David <david.coyle@intel.com> wrote:
>
> >
> > On Fri, Mar 6, 2020 at 8:25 PM Coyle, David <david.coyle@intel.com> wrote:
> > >
> > > > >
> > > > > /** Error Detection Algorithms */
> > > > > enum rte_rawdev_multi_fn_err_detect_algorithm {
> > > > >         RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,
> > > >
> > > > IMO, It does not make sense to add protocol specific stuff in rawdev
> > > > symbols.
> > > >
> > > > IMO, It is better to have a separate library for CRC and BIP32
> > > > acceleration like the rte_security library and underneath still it
> > > > can use rawdev or anydev if required.
> > >
> > > [DC] This protocol stuff is only in the rawdev interface definition, which is
> > known only to the application and the rawdev PMDs which will use this
> > interface.
> > > So these defines/enums/structs etc for CRC and BIP are completely
> > opaque to rte_rawdev itself.
> > >
> > > This is how all existing rawdev PMDs interfaces are defined, where the
> > interface is very specific to the job(s) the PMD is implementing.
> >
> > If you see .map file in driver/raw/. None of the drivers are exposing any API
> > with rte_rawdev_*.
> > This addition will be exposing new rte_rawdev_* APIs from driver/rawdev/.
> > That's is not correct.
> >
> >  $ find drivers/raw/ -name *.map
> > drivers/raw/skeleton/rte_rawdev_skeleton_version.map
> > drivers/raw/octeontx2_ep/rte_rawdev_octeontx2_ep_version.map
> > drivers/raw/ntb/rte_rawdev_ntb_version.map
> > drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map
> > drivers/raw/dpaa2_cmdif/rte_rawdev_dpaa2_cmdif_version.map
> > drivers/raw/ioat/rte_rawdev_ioat_version.map
> > drivers/raw/octeontx2_dma/rte_rawdev_octeontx2_dma_version.map
> > drivers/raw/ifpga/rte_rawdev_ifpga_version.map
> >
> > IMO, Correct thing to do will be,
> >
> > Either of
> >
> > 1) As mentioned below, If you would like to limit the scope only to a new
> > rawdev driver then
> > a) Create a new driver at driver/raw/<new driver>/
> > b) expose the drier specific customer API as
> > rte_<new-driver>_...(example:
> > drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map
> >
> > 2) If we would like to have public API then create a subsystem like libsecurity
> > to have features. Let the API exposed from lib/...
> >
>
> [DC] Yes you are right here, it was incorrect to include rawdev in the interface filename and in the symbols within... rawdev will be removed from all these
> And we are going with option 1 above, to limit this to the new rawdev drivers.
> As I mentioned in the original post, if it is found that this interface could be useful to other drivers/applications in the future, then it can be moved to the public API under lib as a new library or an extension of an existing one possibly

+1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-06 16:22     ` Jerin Jacob
@ 2020-03-13 18:00       ` Coyle, David
  2020-03-13 18:03         ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-03-13 18:00 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Shreyansh Jain, Hemant Agrawal, dev, stephen, Doherty, Declan,
	Trahe, Fiona, Ryan, Brendan, O'loingsigh, Mairtin

> 
> On Fri, Mar 6, 2020 at 8:25 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > > >
> > > > /** Error Detection Algorithms */
> > > > enum rte_rawdev_multi_fn_err_detect_algorithm {
> > > >         RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,
> > >
> > > IMO, It does not make sense to add protocol specific stuff in rawdev
> > > symbols.
> > >
> > > IMO, It is better to have a separate library for CRC and BIP32
> > > acceleration like the rte_security library and underneath still it
> > > can use rawdev or anydev if required.
> >
> > [DC] This protocol stuff is only in the rawdev interface definition, which is
> known only to the application and the rawdev PMDs which will use this
> interface.
> > So these defines/enums/structs etc for CRC and BIP are completely
> opaque to rte_rawdev itself.
> >
> > This is how all existing rawdev PMDs interfaces are defined, where the
> interface is very specific to the job(s) the PMD is implementing.
> 
> If you see .map file in driver/raw/. None of the drivers are exposing any API
> with rte_rawdev_*.
> This addition will be exposing new rte_rawdev_* APIs from driver/rawdev/.
> That's is not correct.
> 
>  $ find drivers/raw/ -name *.map
> drivers/raw/skeleton/rte_rawdev_skeleton_version.map
> drivers/raw/octeontx2_ep/rte_rawdev_octeontx2_ep_version.map
> drivers/raw/ntb/rte_rawdev_ntb_version.map
> drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map
> drivers/raw/dpaa2_cmdif/rte_rawdev_dpaa2_cmdif_version.map
> drivers/raw/ioat/rte_rawdev_ioat_version.map
> drivers/raw/octeontx2_dma/rte_rawdev_octeontx2_dma_version.map
> drivers/raw/ifpga/rte_rawdev_ifpga_version.map
> 
> IMO, Correct thing to do will be,
> 
> Either of
> 
> 1) As mentioned below, If you would like to limit the scope only to a new
> rawdev driver then
> a) Create a new driver at driver/raw/<new driver>/
> b) expose the drier specific customer API as
> rte_<new-driver>_...(example:
> drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map
> 
> 2) If we would like to have public API then create a subsystem like libsecurity
> to have features. Let the API exposed from lib/...
> 

[DC] Yes you are right here, it was incorrect to include rawdev in the interface filename and in the symbols within... rawdev will be removed from all these
And we are going with option 1 above, to limit this to the new rawdev drivers.
As I mentioned in the original post, if it is found that this interface could be useful to other drivers/applications in the future, then it can be moved to the public API under lib as a new library or an extension of an existing one possibly 

> >
> > Also, these particular defines/enums/structs for CRC and BIP are only for
> defining xform and op chains containing these particular operations.
> > The actual code to do the CRC and BIP is already in the AESNI-MB
> > library or DPDK rte_net_crc library, which our aesni_mb and qat rawdev
> > PMDs will call/use
> >
> > >
> > > IMO, Exposing the public API in
> > > drivers/raw/common/rte_rawdev_multi_fn.h is a shortcut.
> > > IMO, public API should be in lib/..
> >
> > [DC] To be honest, I tend to agree. I don't like that public APIs are exposed
> from the drivers directory.
> > But as I mentioned above, this is how all rawdev PMD interfaces are
> > defined, where the interface definition is within the PMD directory
> > (e.g. drivers/raw/dpaa2_cmdif/rte_pmd_dpaa2_cmdif.h)
> > Our's is slightly different in that we have 2 PMDs which will use the
> > same interface, which is why we have added it in drivers/raw/common So
> > by keeping our interface under drivers, we are trying to be consistent
> > with all existing rawdev PMDs
> >
> > As I mentioned in my previous post though, this could potentially be
> > moved under lib in the future if other PMDs would find it useful
> 
> See above. Point (1).
> 
> >
> > We could possibly rename our interface file to rte_pmd_multi_fn.h to be a
> bit more consistent with the majority of the existing PMDs and take away the
> idea for now that this is some kind of extension to the main rte_rawdev API.
> > But unfortunately there is no full consistency in the rawdev PMD
> > interface filenames (e.g. dpaa2_cmdif uses the "rte_pmd_" prefix -
> > rte_pmd_dpaa2_cmdif.h, octeontx2_dma uses the "_rawdev" suffix -
> > otx2_dpi_rawdev.h)
> >
> > >
> > > Just my 2c.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-06 14:55   ` Coyle, David
@ 2020-03-06 16:22     ` Jerin Jacob
  2020-03-13 18:00       ` Coyle, David
  0 siblings, 1 reply; 24+ messages in thread
From: Jerin Jacob @ 2020-03-06 16:22 UTC (permalink / raw)
  To: Coyle, David
  Cc: Shreyansh Jain, Hemant Agrawal, dev, stephen, Doherty, Declan,
	Trahe, Fiona, Ryan, Brendan, O'loingsigh, Mairtin

On Fri, Mar 6, 2020 at 8:25 PM Coyle, David <david.coyle@intel.com> wrote:
>
> > >
> > > /** Error Detection Algorithms */
> > > enum rte_rawdev_multi_fn_err_detect_algorithm {
> > >         RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,
> >
> > IMO, It does not make sense to add protocol specific stuff in rawdev
> > symbols.
> >
> > IMO, It is better to have a separate library for CRC and BIP32 acceleration like
> > the rte_security library and underneath still it can use rawdev or anydev if
> > required.
>
> [DC] This protocol stuff is only in the rawdev interface definition, which is known only to the application and the rawdev PMDs which will use this interface.
> So these defines/enums/structs etc for CRC and BIP are completely opaque to rte_rawdev itself.
>
> This is how all existing rawdev PMDs interfaces are defined, where the interface is very specific to the job(s) the PMD is implementing.

If you see .map file in driver/raw/. None of the drivers are exposing
any API with rte_rawdev_*.
This addition will be exposing new rte_rawdev_* APIs from
driver/rawdev/. That's is not correct.

 $ find drivers/raw/ -name *.map
drivers/raw/skeleton/rte_rawdev_skeleton_version.map
drivers/raw/octeontx2_ep/rte_rawdev_octeontx2_ep_version.map
drivers/raw/ntb/rte_rawdev_ntb_version.map
drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map
drivers/raw/dpaa2_cmdif/rte_rawdev_dpaa2_cmdif_version.map
drivers/raw/ioat/rte_rawdev_ioat_version.map
drivers/raw/octeontx2_dma/rte_rawdev_octeontx2_dma_version.map
drivers/raw/ifpga/rte_rawdev_ifpga_version.map

IMO, Correct thing to do will be,

Either of

1) As mentioned below, If you would like to limit the scope only to a
new rawdev driver then
a) Create a new driver at driver/raw/<new driver>/
b) expose the drier specific customer API as
rte_<new-driver>_...(example:
drivers/raw/dpaa2_qdma/rte_rawdev_dpaa2_qdma_version.map

2) If we would like to have public API then create a subsystem like
libsecurity to have features. Let the API exposed from lib/...

>
> Also, these particular defines/enums/structs for CRC and BIP are only for defining xform and op chains containing these particular operations.
> The actual code to do the CRC and BIP is already in the AESNI-MB library or DPDK rte_net_crc library, which our aesni_mb and qat rawdev PMDs will call/use
>
> >
> > IMO, Exposing the public API in
> > drivers/raw/common/rte_rawdev_multi_fn.h is a shortcut.
> > IMO, public API should be in lib/..
>
> [DC] To be honest, I tend to agree. I don't like that public APIs are exposed from the drivers directory.
> But as I mentioned above, this is how all rawdev PMD interfaces are defined, where the interface definition is within the PMD directory (e.g. drivers/raw/dpaa2_cmdif/rte_pmd_dpaa2_cmdif.h)
> Our's is slightly different in that we have 2 PMDs which will use the same interface, which is why we have added it in drivers/raw/common
> So by keeping our interface under drivers, we are trying to be consistent with all existing rawdev PMDs
>
> As I mentioned in my previous post though, this could potentially be moved under lib in the future if other PMDs would find it useful

See above. Point (1).

>
> We could possibly rename our interface file to rte_pmd_multi_fn.h to be a bit more consistent with the majority of the existing PMDs and take away the idea for now that this is some kind of extension to the main rte_rawdev API.
> But unfortunately there is no full consistency in the rawdev PMD interface filenames (e.g. dpaa2_cmdif uses the "rte_pmd_" prefix - rte_pmd_dpaa2_cmdif.h, octeontx2_dma uses the "_rawdev" suffix - otx2_dpi_rawdev.h)
>
> >
> > Just my 2c.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-06  9:06 ` Jerin Jacob
@ 2020-03-06 14:55   ` Coyle, David
  2020-03-06 16:22     ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-03-06 14:55 UTC (permalink / raw)
  To: Jerin Jacob, Shreyansh Jain, Hemant Agrawal
  Cc: dev, stephen, Doherty, Declan, Trahe, Fiona, Ryan, Brendan,
	O'loingsigh, Mairtin

> >
> > /** Error Detection Algorithms */
> > enum rte_rawdev_multi_fn_err_detect_algorithm {
> >         RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,
> 
> IMO, It does not make sense to add protocol specific stuff in rawdev
> symbols.
> 
> IMO, It is better to have a separate library for CRC and BIP32 acceleration like
> the rte_security library and underneath still it can use rawdev or anydev if
> required.

[DC] This protocol stuff is only in the rawdev interface definition, which is known only to the application and the rawdev PMDs which will use this interface.
So these defines/enums/structs etc for CRC and BIP are completely opaque to rte_rawdev itself.

This is how all existing rawdev PMDs interfaces are defined, where the interface is very specific to the job(s) the PMD is implementing.

Also, these particular defines/enums/structs for CRC and BIP are only for defining xform and op chains containing these particular operations.
The actual code to do the CRC and BIP is already in the AESNI-MB library or DPDK rte_net_crc library, which our aesni_mb and qat rawdev PMDs will call/use

> 
> IMO, Exposing the public API in
> drivers/raw/common/rte_rawdev_multi_fn.h is a shortcut.
> IMO, public API should be in lib/..

[DC] To be honest, I tend to agree. I don't like that public APIs are exposed from the drivers directory.
But as I mentioned above, this is how all rawdev PMD interfaces are defined, where the interface definition is within the PMD directory (e.g. drivers/raw/dpaa2_cmdif/rte_pmd_dpaa2_cmdif.h)
Our's is slightly different in that we have 2 PMDs which will use the same interface, which is why we have added it in drivers/raw/common
So by keeping our interface under drivers, we are trying to be consistent with all existing rawdev PMDs

As I mentioned in my previous post though, this could potentially be moved under lib in the future if other PMDs would find it useful

We could possibly rename our interface file to rte_pmd_multi_fn.h to be a bit more consistent with the majority of the existing PMDs and take away the idea for now that this is some kind of extension to the main rte_rawdev API.
But unfortunately there is no full consistency in the rawdev PMD interface filenames (e.g. dpaa2_cmdif uses the "rte_pmd_" prefix - rte_pmd_dpaa2_cmdif.h, octeontx2_dma uses the "_rawdev" suffix - otx2_dpi_rawdev.h)

> 
> Just my 2c.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
  2020-03-05 16:44 Coyle, David
@ 2020-03-06  9:06 ` Jerin Jacob
  2020-03-06 14:55   ` Coyle, David
  0 siblings, 1 reply; 24+ messages in thread
From: Jerin Jacob @ 2020-03-06  9:06 UTC (permalink / raw)
  To: Coyle, David, Shreyansh Jain, Hemant Agrawal
  Cc: dev, stephen, Doherty, Declan, Trahe, Fiona

On Thu, Mar 5, 2020 at 10:14 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Having taken feedback from the community into account, we would like to propose some changes to our approach for combining multiple packet-processing functions into a single operation on a single device, be that an optimized software library or a hardware accelerator.

> Note that development work is already progressing well on this new approach, with the aim of upstreaming this into DPDK v20.05.
>
> Also, we may consider consolidating the multi-function API into the main DPDK library in the future if there were more devices which wanted to support these multi-function operations.
>
> The following is the proposed rawdev multi-function interface, defined in 'drivers/raw/common/rte_rawdev_multi_fn.h'
>
> /* SPDX-License-Identifier: BSD-3-Clause
>  * Copyright(c) 2020 Intel Corporation.
>  */
>
> #ifndef _RTE_RAWDEV_MULTI_FN_H_
> #define _RTE_RAWDEV_MULTI_FN_H_
>
> #ifdef __cplusplus
> extern "C" {
> #endif
>
> #include <rte_compat.h>
> #include <rte_common.h>
> #include <rte_mbuf.h>
> #include <rte_memory.h>
> #include <rte_mempool.h>
> #include <rte_comp.h>
> #include <rte_crypto.h>
> #include <rte_rawdev.h>
>
> /** Error Detection Algorithms */
> enum rte_rawdev_multi_fn_err_detect_algorithm {
>         RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,

IMO, It does not make sense to add protocol specific stuff in rawdev symbols.

IMO, It is better to have a separate library for CRC and BIP32
acceleration like the rte_security library and
underneath still it can use rawdev or anydev if required.

IMO, Exposing the public API in
drivers/raw/common/rte_rawdev_multi_fn.h is a shortcut.
IMO, public API should be in lib/..

Just my 2c.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [RFC] Accelerator API to chain packet processing functions
@ 2020-03-05 16:44 Coyle, David
  2020-03-06  9:06 ` Jerin Jacob
  0 siblings, 1 reply; 24+ messages in thread
From: Coyle, David @ 2020-03-05 16:44 UTC (permalink / raw)
  To: dev, Jerin Jacob, stephen; +Cc: Doherty, Declan, Trahe, Fiona

Having taken feedback from the community into account, we would like to propose some changes to our approach for combining multiple packet-processing functions into a single operation on a single device, be that an optimized software library or a hardware accelerator.

The main feedback on the rte_accelerator API can be summarized as follows:
1) Why are we creating another new library that performs tasks similar to other existing APIs... why not try and converge on one?
2) The term "accelerator" is too broad a term, if the API is primarily focused on Crypto + CRC

We also felt that using the rte_cryptodev and rte_compressdev APIs to initialize, configure and reset the devices and then using rte_accelerator for session creation and operation enqueue/dequeue was confusing matters.
We believe the new approach addresses the above concerns and also greatly simplifies the solution.

Our new approach proposes to use the already existing rte_rawdev API with some added functionality for creating "multi-function" sessions.

At the high level, the main changes are:
	- The rte_accelerator library will no longer be added
	- The rte_rawdev API will be used to initialize, configure, reset a device
	- A new rawdev interface for "multi-function" sessions/operations will be added under the new directory 'drivers/raw/common'
		- this interface's header file will be called 'rte_rawdev_multi_fn.h' (with an accompanying C file for function implementations)
		- this header file will contain much of what was previously included in rte_accelerator.h and rte_err_detect.h, such as:
			- enums and structs for defining a multi-function chain of xforms, using xform definitions from rte_cryptodev and rte_compressdev as necessary
			- enums and structs for defining a multi-function chain of ops, again using op definitions from rte_cryptodev and rte_compressdev as necessary
			- enums and structs for defining error-detection xforms and ops
			- two API function definitions to create and destroy a session based on a xform chain
				- rte_rawdev_multi_fn_session_create()
				- rte_rawdev_multi_fn_session_destroy()
		- application code will include rte_rawdev_multi_fn.h to access the structs, enums and functions for creating xform chains, sessions and op chains
		- keeping the multi-function interface under the 'drivers' directory means that rte_rawdev itself remains completely "raw", with no knowledge of xforms, sessions or ops
		- a proposal for this header file is included at the end
	- The rte_rawdev API will be used to enqueue/dequeue the operations using the existing rte_rawdev_enqueue_buffers() and rte_rawdev_dequeue_buffers()
		- a synchronous API function could potentially be added to rte_rawdev in the future if required, to avoid the overhead of enqueue/dequeue for the optimized software library use-case (e.g. rte_rawdev_process_buffers())
	- Two new rawdev PMDs for will be added under 'drivers/raw' for QAT and AESNI-MB
		- these two rawdev PMDs will use and implement the multi-function interface defined in 'drivers/raw/common/rte_rawdev_multi_fn.h'
		- as with all other rawdev PMDs, the interface is known only to the application and the PMD itself, and is opaque to rte_rawdev itself
		- the PMDs will be added under 'drivers/raw/aesni_mb' and 'drivers/raw/qat'
		- other PMDs (existing or new) could use this multi-function interface in the future if use-cases arise
	- The rte_rawdev library will be used as is, with no changes required
	

The initial use cases for the multi-function rawdev interface remain the same as for the previously proposed rte_accelerator:
	- DOCSIS MAC: Crypto + CRC
	- XGS-PON MAC: Crypto + CRC + BIP

However, the API can still also accommodate other chained functions such as Compression + Crypto and UDP Checksum + Crypto.

The following diagram shows the new architecture:

    +-----------------------------------------------------------+
    |                                                           |
    |                      Application                          |
    |        (e.g. vCMTS (DOCSIS), vOLT (XGS-PON), etc.)        |
    |                                                           |
    +-----------------------------------------------------------+
                                |
    +---------------------------|-------------------------------+
    |                           |                       DPDK    |
    |                           |                               |
    |                 +---------------------+                   |
    |                 |                     |                   |
    |                 |     rte_rawdev      |                   |
    |                 |                     |                   |               NOTE:
    |                 +---------------------+       ____________|_______ 'RAWDEV MULTI-FUNCTION
    |                        /      \              /            |           API' is opaque to
    |                       /        \            /             |             rte_rawdev
    |                      /          \          /              |
    |           +--------------------------------+              |
    |           |   RAWDEV MULTI-FUNCTION API    |              |
    |           +--------------------------------+              |
    |           +------------+      +------------+              |
    |           |   RAWDEV   |      |   RAWDEV   |              |
    |           |  AESNI-MB  |      |    QAT     |              |
    |           |    PMD     |      |    PMD     |              |
    |           +------------+      +------------+              |
    |                  |                  |                     |
    +------------------|------------------|---------------------+
                       |                  |
                +------------+      +------------+
                |  AESNI-MB  |      |   QAT HW   |
                |   SW LIB   |      |            |
                +------------+      +------------+


Note that development work is already progressing well on this new approach, with the aim of upstreaming this into DPDK v20.05.

Also, we may consider consolidating the multi-function API into the main DPDK library in the future if there were more devices which wanted to support these multi-function operations.

The following is the proposed rawdev multi-function interface, defined in 'drivers/raw/common/rte_rawdev_multi_fn.h'

/* SPDX-License-Identifier: BSD-3-Clause
 * Copyright(c) 2020 Intel Corporation.
 */

#ifndef _RTE_RAWDEV_MULTI_FN_H_
#define _RTE_RAWDEV_MULTI_FN_H_

#ifdef __cplusplus
extern "C" {
#endif

#include <rte_compat.h>
#include <rte_common.h>
#include <rte_mbuf.h>
#include <rte_memory.h>
#include <rte_mempool.h>
#include <rte_comp.h>
#include <rte_crypto.h>
#include <rte_rawdev.h>

/** Error Detection Algorithms */
enum rte_rawdev_multi_fn_err_detect_algorithm {
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_CRC32_ETH,
	/**< CRC32 Ethernet */
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_BIP32
	/**< BIP32 */
};

/** Error Detection Operation Types */
enum rte_rawdev_multi_fn_err_detect_operation {
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_VERIFY,
	/**< Verify error detection result */
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_GENERATE
	/**< Generate error detection result */
};

/** Error Detection Status */
enum rte_rawdev_multi_fn_err_detect_op_status {
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_STATUS_NOT_PROCESSED,
	/**< Operation has not yet been processed by a device */
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_STATUS_SUCCESS,
	/**< Operation completed successfully */
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_STATUS_VERIFY_FAILED,
	/**< Verification failed */
	RTE_RAWDEV_MULTI_FN_ERR_DETECT_OP_STATUS_ERROR
	/**< Error handling operation */
};

/**
 * Error Detection Transform Data
 *
 * This structure contains data relating to an error detection transform. The
 * fields *op* and *algo* are common to all error detection transforms and
 * MUST be set
 */
struct rte_rawdev_multi_fn_err_detect_xform {
	enum rte_rawdev_multi_fn_err_detect_operation op;
	/**< Error detection operation type */
	enum rte_rawdev_multi_fn_err_detect_algorithm algo;
	/**< Error detection algorithm */
};

/** Error Detection Operation */
struct rte_rawdev_multi_fn_err_detect_op {
	struct rte_mbuf *m_src; /**< Source mbuf */
	enum rte_rawdev_multi_fn_err_detect_op_status status;
	/**< Operation status */

	struct {
		uint16_t offset;
		/**<
		 * Starting point for error detection processing, specified
		 * as the number of bytes from start of the packet in the
		 * source mbuf
		 */
		uint16_t length;
		/**<
		 * The length, in bytes, of the source mbuf on which the error
		 * detection operation will be computed
		 */
	} data; /**< Data offset and length for error detection */

	struct {
		uint8_t *data;
		/**<
		 * This points to the location where the error detection
		 * result should be written (in the case of generation) or
		 * where the purported result exists (in the case of
		 * verification)
		 *
		 * The caller must ensure the required length of physically
		 * contiguous memory is available at this address
		 *
		 * For a CRC, this may point into the mbuf packet data. For
		 * an operation such as a BIP, this may point to a memory
		 * location after the op
		 *
		 * For generation, the result will overwrite any data at this
		 * location
		 */
		rte_iova_t phys_addr;
		/**< Physical address of output data */
	} output; /**< Output location */
};

/**
 * Multi-function transform types
 */
enum rte_rawdev_multi_fn_xform_type {
	RTE_RAWDEV_MULTI_FN_XFORM_TYPE_CRYPTO_SYM,
	/**< Symmetric crypto transform type */
	RTE_RAWDEV_MULTI_FN_XFORM_TYPE_CRYPTO_ASYM,
	/**< Asymmetric crypto transform type */
	RTE_RAWDEV_MULTI_FN_XFORM_TYPE_COMP,
	/**< Compression transform type */
	RTE_RAWDEV_MULTI_FN_XFORM_TYPE_ERR_DETECT
	/**< Error detection transform type */
};

/**
 * Multi-function transform setup data
 *
 * This structure is used to specify the multi-function transforms required.
 * Multiple transforms can be chained together to specify a chain of transforms
 * such as symmetric crypto followed by error detection, or compression followed
 * by symmetric crypto. Each transform structure holds a single transform, with
 * the type field specifying which transform is contained within the union.
 */
struct rte_rawdev_multi_fn_xform {
	struct rte_rawdev_multi_fn_xform *next;
	/**<
	 * Next transform in the chain
	 * - the last transform in the chain MUST set this to NULL
	 */
	enum rte_rawdev_multi_fn_xform_type type;
	/**< Transform type */

	RTE_STD_C11
	union {
		struct rte_crypto_sym_xform crypto_sym;
		/**< Symmetric crypto transform */
		struct rte_crypto_asym_xform crypto_asym;
		/**< Asymmetric crypto transform */
		struct rte_comp_xform comp;
		/**< Compression transform */
		struct rte_rawdev_multi_fn_err_detect_xform err_detect;
		/**< Error detection transform */
	};
};

/**
 * Multi-function operation status
 */
enum rte_rawdev_multi_fn_op_status {
	RTE_RAWDEV_MULTI_FN_OP_STATUS_SUCCESS = 0,
	/**< Operation completed successfully */
	RTE_RAWDEV_MULTI_FN_OP_STATUS_FAILURE,
	/**< Operation completed with failure */
	RTE_RAWDEV_MULTI_FN_STATUS_INVALID_SESSION,
	/**< Operation failed due to invalid session arguments */
	RTE_RAWDEV_MULTI_FN_OP_STATUS_NOT_PROCESSED,
	/**< Operation has not yet been processed by a device */
};

/**
 * Multi-function session data
 */
struct rte_rawdev_multi_fn_session;

/**
 * Multi-function operation data
 *
 * This structure is used to specify the operations for a particular session.
 * This includes specifying the source and, if required, destination mbufs and
 * the lengths and offsets of the data within these mbufs on which the
 * operations should be done. Multiple operations are chained together to
 * specify the full set of operations to be performed
 *
 * @note The rte_rawdev_multi_fn_op chain MUST match the session's xform
 * chain exactly
 * @note The first rte_rawdev_multi_fn_op element in the chain is the parent
 * operation. The following fields MUST be set in this first operation before
 * enqueuing and are ignored in the inner operations and any subsequent
 * rte_rawdev_multi_fn_op chain elements:
 * - *sess*
 * - *m_src*
 * - *m_dst* (if required)
 * @note If *sess* or *m_src* is not set in the first rte_rawdev_multi_fn_op,
 * this operation is invalid and will cause an error when attempting to enqueue.
 * @note The following fields MUST be set in ALL rte_rawdev_multi_fn_op chain
 * elements:
 * - *next*
 * - *mempool*
 * - *type*
 * @note After the operation has been dequeued, only the FIRST (i.e. the parent)
 * rte_rawdev_multi_fn_op in the chain will contain the *overall_status*. Each
 * chain element will contain it's individual *op_status*, the value of which is
 * relevant to operation type (e.g. an ::rte_crypto_op_status,
 * ::rte_comp_op_status or ::rte_err_detect_op_status)
 */
struct rte_rawdev_multi_fn_op {
	struct rte_rawdev_multi_fn_op *next;
	/**<
	 * Next operation in the chain
	 * - the last operation in the chain MUST set this to NULL
	 */
	struct rte_rawdev_multi_fn_session *sess;
	/**< Handle for the associated multi-function session */

	struct rte_mempool *mempool;
	/**< Mempool from which the operation is allocated */

	struct rte_mbuf *m_src; /**< Source mbuf */
	struct rte_mbuf *m_dst; /**< Destination mbuf */

	enum rte_rawdev_multi_fn overall_status;
	/**<
	 * Overall operation status
	 * - indicates if all the operations in the chain succeeded or if any
	 *   one of them failed
	 */

	uint8_t op_status;
	/**<
	 * Individual operation status
	 * - indicates the status of the individual operation in the chain
	 */

	RTE_STD_C11
	union {
		struct rte_crypto_sym_op crypto_sym;
		/**< Symmetric crypto operation */
		struct rte_crypto_asym_op crypto_asym;
		/**< Asymmetric crypto operation */
		struct rte_comp_op comp;
		/**< Compression operation */
		struct rte_rawdev_multi_fn_err_detect_op err_detect;
		/**< Error detection operation */
	};
};

/**
 * Create multi-function session as specified by the transform chain
 *
 * @param   dev_info	Device info, obtained by calling rte_rawdev_info_get()
 * @param   xform	Pointer to the first element of the session transform
 * 			chain
 * @param   socket_id	Socket to allocate the session on
 *
 * @return
 *  - Pointer to session, if successful
 *  - NULL, on failure
 */
__rte_experimental
struct rte_rawdev_multi_fn_session *
rte_rawdev_multi_fn_session_create(struct rte_rawdev_info *dev_info,
				   struct rte_rawdev_multi_fn_xform *xform,
				   int socket_id);

/**
 * Free memory assoicated with a multi-function session
 *
 * @param   dev_info	Device info, obtained by calling rte_rawdev_info_get()
 * @param   sess	Multi-function session to be freed
 *
 * @return
 *  - 0, if successful
 *  - -EINVAL, if session is NULL
 *  - -EBUSY, if not all session data has been freed
 */
__rte_experimental
int
rte_rawdev_multi_fn_session_destroy(struct rte_rawdev_info *dev_info,
				    struct rte_rawdev_multi_fn_session *sess);

#ifdef __cplusplus
}
#endif

#endif /* _RTE_RAWDEV_MULTI_FN_H_ */

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04 14:45 [dpdk-dev] [RFC] Accelerator API to chain packet processing functions David Coyle
2020-02-04 19:52 ` Jerin Jacob
2020-02-06 10:04   ` Coyle, David
2020-02-06 10:54     ` Jerin Jacob
2020-02-06 16:31       ` Coyle, David
2020-02-06 17:13         ` Jerin Jacob
2020-02-07 12:38           ` Coyle, David
2020-02-07 14:18             ` Jerin Jacob
2020-02-07 20:34               ` Stephen Hemminger
2020-02-08  7:22                 ` Jerin Jacob
2020-03-05 17:01                   ` Coyle, David
2020-03-06  8:43                     ` Jerin Jacob
2020-02-13 11:50               ` Doherty, Declan
2020-02-18  5:15                 ` Jerin Jacob
2020-02-13 11:44           ` Doherty, Declan
2020-02-18  5:30             ` Jerin Jacob
2020-02-13 11:31       ` Doherty, Declan
2020-02-18  5:12         ` Jerin Jacob
2020-03-05 16:44 Coyle, David
2020-03-06  9:06 ` Jerin Jacob
2020-03-06 14:55   ` Coyle, David
2020-03-06 16:22     ` Jerin Jacob
2020-03-13 18:00       ` Coyle, David
2020-03-13 18:03         ` Jerin Jacob

DPDK patches and discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox