DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf
@ 2021-03-09 23:57 Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 1/3] common/mlx5: add user memory registration bits Suanming Mou
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-09 23:57 UTC (permalink / raw)
  To: viacheslavo, matan, orika; +Cc: rasland, dev

The scattered mbuf was not supported in mlx5 RegEx driver. This patch
set adds the support of scattered mbuf by UMR WQE.

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.


*** BLURB HERE ***

Suanming Mou (3):
  common/mlx5: add user memory registration bits
  regex/mlx5: add data path scattered mbuf process
  app/test-regex: support scattered mbuf input

 app/test-regex/main.c                    | 134 ++++++--
 drivers/common/mlx5/linux/meson.build    |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.c     |   5 +
 drivers/common/mlx5/mlx5_devx_cmds.h     |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
 8 files changed, 517 insertions(+), 83 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH 1/3] common/mlx5: add user memory registration bits
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-03-09 23:57 ` Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-09 23:57 UTC (permalink / raw)
  To: viacheslavo, matan, orika; +Cc: rasland, dev

This commit adds the UMR capability bits.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 2 ++
 drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
 drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 220de35420..5d6a861689 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -186,6 +186,8 @@ has_sym_args = [
 	'mlx5dv_dr_action_create_aso' ],
 	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
 	'INFINIBAND_VERBS_H' ],
+        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
+        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
 ]
 config = configuration_data()
 foreach arg:has_sym_args
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index 0060c37fc0..b4b7a76db0 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, attr->pd);
 	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
+	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
 	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
@@ -749,6 +750,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						mini_cqe_resp_flow_tag);
 	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
 						 mini_cqe_resp_l3_l4_tag);
+	attr->umr_indirect_mkey_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_indirect_mkey_disabled);
+	attr->umr_modify_entity_size_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_modify_entity_size_disabled);
 	if (attr->qos.sup) {
 		MLX5_SET(query_hca_cap_in, in, op_mod,
 			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index bc66d28e83..e300c307e1 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t umr_en:1;
 	struct mlx5_klm *klm_array;
 	int klm_num;
 };
@@ -147,6 +148,8 @@ struct mlx5_hca_attr {
 	uint32_t log_max_mmo_dma:5;
 	uint32_t log_max_mmo_compress:5;
 	uint32_t log_max_mmo_decompress:5;
+	uint32_t umr_modify_entity_size_disabled:1;
+	uint32_t umr_indirect_mkey_disabled:1;
 };
 
 struct mlx5_devx_wq_attr {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH 2/3] regex/mlx5: add data path scattered mbuf process
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 1/3] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-03-09 23:57 ` Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 3/3] app/test-regex: support scattered mbuf input Suanming Mou
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-09 23:57 UTC (permalink / raw)
  To: viacheslavo, matan, orika; +Cc: rasland, dev

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
 4 files changed, 398 insertions(+), 58 deletions(-)

diff --git a/drivers/regex/mlx5/mlx5_regex.c b/drivers/regex/mlx5/mlx5_regex.c
index f1fd911405..e2031c76a9 100644
--- a/drivers/regex/mlx5/mlx5_regex.c
+++ b/drivers/regex/mlx5/mlx5_regex.c
@@ -198,6 +198,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	}
 	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
 	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
+#ifdef HAVE_MLX5_UMR_IMKEY
+	if (!attr.umr_indirect_mkey_disabled &&
+	    !attr.umr_modify_entity_size_disabled)
+		priv->has_umr = 1;
+	if (priv->has_umr)
+		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
+#endif
 	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
 	priv->regexdev->device = (struct rte_device *)pci_dev;
 	priv->regexdev->data->dev_private = priv;
@@ -212,6 +219,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	    rte_errno = ENOMEM;
 		goto error;
 	}
+	DRV_LOG(INFO, "RegEx GGA is %s.",
+		priv->has_umr ? "supported" : "unsupported");
 	return 0;
 
 error:
diff --git a/drivers/regex/mlx5/mlx5_regex.h b/drivers/regex/mlx5/mlx5_regex.h
index 484819c38c..d27376453c 100644
--- a/drivers/regex/mlx5/mlx5_regex.h
+++ b/drivers/regex/mlx5/mlx5_regex.h
@@ -15,6 +15,7 @@
 #include <mlx5_common_devx.h>
 
 #include "mlx5_rxp.h"
+#include "mlx5_regex_utils.h"
 
 struct mlx5_regex_sq {
 	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
@@ -40,6 +41,7 @@ struct mlx5_regex_qp {
 	struct mlx5_regex_job *jobs;
 	struct ibv_mr *metadata;
 	struct ibv_mr *outputs;
+	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
 	size_t ci, pi;
 	struct mlx5_mr_ctrl mr_ctrl;
 };
@@ -70,8 +72,29 @@ struct mlx5_regex_priv {
 	struct ibv_pd *pd;
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	uint8_t is_bf2; /* The device is BF2 device. */
+	uint8_t has_umr; /* The device supports UMR. */
 };
 
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+static inline int
+regex_get_pdn(void *pd, uint32_t *pdn)
+{
+	struct mlx5dv_obj obj;
+	struct mlx5dv_pd pd_info;
+	int ret = 0;
+
+	obj.pd.in = pd;
+	obj.pd.out = &pd_info;
+	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
+	if (ret) {
+		DRV_LOG(DEBUG, "Fail to get PD object info");
+		return ret;
+	}
+	*pdn = pd_info.pdn;
+	return 0;
+}
+#endif
+
 /* mlx5_regex.c */
 int mlx5_regex_start(struct rte_regexdev *dev);
 int mlx5_regex_stop(struct rte_regexdev *dev);
@@ -107,5 +130,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
 uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
-
+uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+		       struct rte_regex_ops **ops, uint16_t nb_ops);
 #endif /* MLX5_REGEX_H */
diff --git a/drivers/regex/mlx5/mlx5_regex_control.c b/drivers/regex/mlx5/mlx5_regex_control.c
index df57fada5d..422ea3d483 100644
--- a/drivers/regex/mlx5/mlx5_regex_control.c
+++ b/drivers/regex/mlx5/mlx5_regex_control.c
@@ -27,6 +27,9 @@
 
 #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
 
+#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
+		((has_umr) ? ((log_desc) + 2) : (log_desc))
+
 /**
  * Returns the number of qp obj to be created.
  *
@@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv, struct mlx5_regex_cq *cq)
 	return 0;
 }
 
-#ifdef HAVE_IBV_FLOW_DV_SUPPORT
-static int
-regex_get_pdn(void *pd, uint32_t *pdn)
-{
-	struct mlx5dv_obj obj;
-	struct mlx5dv_pd pd_info;
-	int ret = 0;
-
-	obj.pd.in = pd;
-	obj.pd.out = &pd_info;
-	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
-	if (ret) {
-		DRV_LOG(DEBUG, "Fail to get PD object info");
-		return ret;
-	}
-	*pdn = pd_info.pdn;
-	return 0;
-}
-#endif
-
 /**
  * Destroy the SQ object.
  *
@@ -167,14 +150,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	int ret;
 
 	sq->log_nb_desc = log_nb_desc;
+	sq->sqn = q_ind;
 	sq->ci = 0;
 	sq->pi = 0;
 	ret = regex_get_pdn(priv->pd, &pd_num);
 	if (ret)
 		return ret;
 	attr.wq_attr.pd = pd_num;
-	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
-				  SOCKET_ID_ANY);
+	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
+			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_nb_desc),
+			&attr, SOCKET_ID_ANY);
 	if (ret) {
 		DRV_LOG(ERR, "Can't create SQ object.");
 		rte_errno = ENOMEM;
@@ -224,10 +209,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev, uint16_t qp_ind,
 
 	qp = &priv->qps[qp_ind];
 	qp->flags = cfg->qp_conf_flags;
-	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
-	qp->nb_desc = 1 << qp->cq.log_nb_desc;
+	log_desc = rte_log2_u32(cfg->nb_desc);
+	/*
+	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one descriptor.
+	 * For CQ, expand the CQE number multiple with 2.
+	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4 WQEBBS,
+	 * expand the WQE number multiple with 4.
+	 */
+	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
+	qp->nb_desc = 1 << log_desc;
 	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
-		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
+		qp->nb_obj = regex_ctrl_get_nb_obj
+			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_desc));
 	else
 		qp->nb_obj = 1;
 	qp->sqs = rte_malloc(NULL,
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index beaea7b63f..4f9402c583 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -32,6 +32,15 @@
 #define MLX5_REGEX_WQE_GATHER_OFFSET 32
 #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
 #define MLX5_REGEX_METADATA_OFF 32
+#define MLX5_REGEX_UMR_WQE_SIZE 192
+/* The maximum KLMs can be added to one UMR indirect mkey. */
+#define MLX5_REGEX_MAX_KLM_NUM 128
+/* The KLM array size for one job. */
+#define MLX5_REGEX_KLMS_SIZE \
+	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
+/* In WQE set mode, the pi should be quarter of the MLX5_REGEX_MAX_WQE_INDEX. */
+#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
+	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
 
 static inline uint32_t
 sq_size_get(struct mlx5_regex_sq *sq)
@@ -49,6 +58,8 @@ struct mlx5_regex_job {
 	uint64_t user_id;
 	volatile uint8_t *output;
 	volatile uint8_t *metadata;
+	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
+	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
 } __rte_cached_aligned;
 
 static inline void
@@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
 }
 
 static inline void
-prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
-	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
-	 struct mlx5_regex_job *job)
+__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
+	   size_t pi, struct mlx5_klm *klm)
 {
-	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) * MLX5_SEND_WQE_BB;
-	uint32_t lkey;
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint16_t group0 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
 				op->group_id0 : 0;
 	uint16_t group1 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
@@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
 			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
 		group0 = op->group_id0;
-	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
-				  &priv->mr_scache, &qp->mr_ctrl,
-				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
-				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
 	int ds = 4; /*  ctrl + meta + input + output */
 
-	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
+	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
+			 (priv->has_umr ? (pi * 4 + 3) : pi),
 			 MLX5_OPCODE_MMO, MLX5_OPC_MOD_MMO_REGEX,
 			 sq->sq_obj.sq->id, 0, ds, 0, 0);
 	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
@@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	struct mlx5_wqe_data_seg *input_seg =
 		(struct mlx5_wqe_data_seg *)(wqe +
 					     MLX5_REGEX_WQE_GATHER_OFFSET);
-	input_seg->byte_count =
-		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
-	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
-							    uintptr_t));
-	input_seg->lkey = lkey;
+	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
+	input_seg->addr = rte_cpu_to_be_64(klm->address);
+	input_seg->lkey = klm->mkey;
 	job->user_id = op->user_id;
+}
+
+static inline void
+prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
+	 struct mlx5_regex_job *job)
+{
+	struct mlx5_klm klm;
+
+	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
+	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
+				  &priv->mr_scache, &qp->mr_ctrl,
+				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
+				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
+	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
+	__prep_one(priv, sq, op, job, sq->pi, &klm);
 	sq->db_pi = sq->pi;
 	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
 }
 
 static inline void
-send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
+send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 {
+	struct mlx5dv_devx_uar *uar = priv->uar;
 	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
-		MLX5_SEND_WQE_BB;
+		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
-	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
+	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |= MLX5_WQE_CTRL_CQ_UPDATE;
 	uint64_t *doorbell_addr =
 		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
 	rte_io_wmb();
-	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi + 1) &
-						 MLX5_REGEX_MAX_WQE_INDEX);
+	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv->has_umr ?
+					(sq->db_pi * 4 + 3) : sq->db_pi) &
+					MLX5_REGEX_MAX_WQE_INDEX);
 	rte_wmb();
 	*doorbell_addr = *(volatile uint64_t *)wqe;
 	rte_wmb();
 }
 
 static inline int
-can_send(struct mlx5_regex_sq *sq) {
-	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
+get_free(struct mlx5_regex_sq *sq) {
+	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
 }
 
 static inline uint32_t
@@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
 	return qid * sq_size + (index & (sq_size - 1));
 }
 
+#ifdef HAVE_MLX5_UMR_IMKEY
+static inline int
+mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
+{
+	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
+}
+
+static inline void
+complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
+		 struct mlx5_regex_job *mkey_job,
+		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
+{
+	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
+		(MLX5_SEND_WQE_BB * 4);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+	struct mlx5_wqe_umr_ctrl_seg *ucseg =
+				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
+	struct mlx5_wqe_mkey_context_seg *mkc =
+				(struct mlx5_wqe_mkey_context_seg *)(ucseg + 1);
+	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
+	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
+
+	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9 WQE_DS. */
+	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
+			 0, sq->sq_obj.sq->id, 0, 9, 0,
+			 rte_cpu_to_be_32(mkey_job->imkey->id));
+	/* Set UMR WQE control seg. */
+	ucseg->mkey_mask |= rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
+				MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
+				MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
+	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
+	/* Set mkey context seg. */
+	mkc->len = rte_cpu_to_be_64(total_len);
+	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
+					(mkey_job->imkey->id & 0xff));
+	/* Set UMR pointer to data seg. */
+	iklm->address = rte_cpu_to_be_64
+				((uintptr_t)((char *)mkey_job->imkey_array));
+	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
+	iklm->byte_count = rte_cpu_to_be_32(klm_align);
+	/* Clear the padding memory. */
+	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
+	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
+
+	/* Add the following RegEx WQE with fence. */
+	wqe = (struct mlx5_wqe_ctrl_seg *)
+				(((uint8_t *)wqe) + MLX5_REGEX_UMR_WQE_SIZE);
+	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
+}
+
+static inline void
+prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
+		       size_t pi, struct mlx5_klm *klm)
+{
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << 2);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+
+	/* Clear the WQE memory used as UMR WQE previously. */
+	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) != MLX5_OPCODE_NOP)
+		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
+	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq->id,
+			 0, 12, 0, 0);
+	__prep_one(priv, sq, op, job, pi, klm);
+}
+
+static inline void
+prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
+{
+	struct mlx5_regex_job *job = NULL;
+	size_t sqid = sq->sqn, mkey_job_id = 0;
+	size_t left_ops = nb_ops;
+	uint32_t klm_num = 0, len;
+	struct mlx5_klm *mkey_klm = NULL;
+	struct mlx5_klm klm;
+
+	sqid = sq->sqn;
+	while (left_ops--)
+		rte_prefetch0(op[left_ops]);
+	left_ops = nb_ops;
+	/*
+	 * Build the WQE set by reverse. In case the burst may consume
+	 * multiple mkeys, build the WQE set as normal will hard to
+	 * address the last mkey index, since we will only know the last
+	 * RegEx WQE's index when finishes building.
+	 */
+	while (left_ops--) {
+		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
+		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
+
+		if (mbuf->nb_segs > 1) {
+			size_t scatter_size = 0;
+
+			if (!mkey_klm_available(mkey_klm, klm_num,
+						mbuf->nb_segs)) {
+				/*
+				 * The mkey's KLM is full, create the UMR
+				 * WQE in the next WQE set.
+				 */
+				if (mkey_klm)
+					complete_umr_wqe(qp, sq,
+						&qp->jobs[mkey_job_id],
+						MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
+						klm_num, len);
+				/*
+				 * Get the indircet mkey and KLM array index
+				 * from the last WQE set.
+				 */
+				mkey_job_id = job_id_get(sqid,
+							 sq_size_get(sq), pi);
+				mkey_klm = qp->jobs[mkey_job_id].imkey_array;
+				klm_num = 0;
+				len = 0;
+			}
+			/* Build RegEx WQE's data segment KLM. */
+			klm.address = len;
+			klm.mkey = rte_cpu_to_be_32
+					(qp->jobs[mkey_job_id].imkey->id);
+			while (mbuf) {
+				/* Build indirect mkey seg's KLM. */
+				mkey_klm->mkey = mlx5_mr_addr2mr_bh(priv->pd,
+					NULL, &priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+				mkey_klm->address = rte_cpu_to_be_64
+					(rte_pktmbuf_mtod(mbuf, uintptr_t));
+				mkey_klm->byte_count = rte_cpu_to_be_32
+						(rte_pktmbuf_data_len(mbuf));
+				/*
+				 * Save the mbuf's total size for RegEx data
+				 * segment.
+				 */
+				scatter_size += rte_pktmbuf_data_len(mbuf);
+				mkey_klm++;
+				klm_num++;
+				mbuf = mbuf->next;
+			}
+			len += scatter_size;
+			klm.byte_count = scatter_size;
+		} else {
+			/* The single mubf case. Build the KLM directly. */
+			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
+					&priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
+			klm.byte_count = rte_pktmbuf_data_len(mbuf);
+		}
+		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
+		/*
+		 * Build the nop + RegEx WQE set by default. The fist nop WQE
+		 * will be updated later as UMR WQE if scattered mubf exist.
+		 */
+		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
+	}
+	/*
+	 * Scattered mbuf have been added to the KLM array. Complete the build
+	 * of UMR WQE, update the first nop WQE as UMR WQE.
+	 */
+	if (mkey_klm)
+		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
+				 klm_num, len);
+	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
+	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
+}
+
+uint16_t
+mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+			  struct rte_regex_ops **ops, uint16_t nb_ops)
+{
+	struct mlx5_regex_priv *priv = dev->data->dev_private;
+	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
+	struct mlx5_regex_sq *sq;
+	size_t sqid, nb_left = nb_ops, nb_desc;
+
+	while ((sqid = ffs(queue->free_sqs))) {
+		sqid--; /* ffs returns 1 for bit 0 */
+		sq = &queue->sqs[sqid];
+		nb_desc = get_free(sq);
+		if (nb_desc) {
+			/* The ops be handled can't exceed nb_ops. */
+			if (nb_desc > nb_left)
+				nb_desc = nb_left;
+			else
+				queue->free_sqs &= ~(1 << sqid);
+			prep_regex_umr_wqe_set(priv, queue, sq, ops, nb_desc);
+			send_doorbell(priv, sq);
+			nb_left -= nb_desc;
+		}
+		if (!nb_left)
+			break;
+		ops += nb_desc;
+	}
+	nb_ops -= nb_left;
+	queue->pi += nb_ops;
+	return nb_ops;
+}
+#endif
+
 uint16_t
 mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		      struct rte_regex_ops **ops, uint16_t nb_ops)
@@ -186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (can_send(sq)) {
+		while (get_free(sq)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
 			if (unlikely(i == nb_ops)) {
-				send_doorbell(priv->uar, sq);
+				send_doorbell(priv, sq);
 				goto out;
 			}
 		}
 		queue->free_sqs &= ~(1 << sqid);
-		send_doorbell(priv->uar, sq);
+		send_doorbell(priv, sq);
 	}
 
 out:
@@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			  MLX5_REGEX_MAX_WQE_INDEX;
 		size_t sqid = cqe->rsvd3[2];
 		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
+
+		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
+		if (priv->has_umr)
+			wq_counter >>= 2;
 		while (sq->ci != wq_counter) {
 			if (unlikely(i == nb_ops)) {
 				/* Return without updating cq->ci */
@@ -316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
 						     sq->ci);
 			extract_result(ops[i], &queue->jobs[job_id]);
-			sq->ci = (sq->ci + 1) & MLX5_REGEX_MAX_WQE_INDEX;
+			sq->ci = (sq->ci + 1) & (priv->has_umr ?
+				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+				  MLX5_REGEX_MAX_WQE_INDEX);
 			i++;
 		}
 		cq->ci = (cq->ci + 1) & 0xffffff;
@@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 }
 
 static void
-setup_sqs(struct mlx5_regex_qp *queue)
+setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
 {
 	size_t sqid, entry;
 	uint32_t job_id;
@@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
 			job_id = sqid * sq_size_get(sq) + entry;
 			struct mlx5_regex_job *job = &queue->jobs[job_id];
 
+			/* Fill UMR WQE with NOP in advanced. */
+			if (priv->has_umr) {
+				set_wqe_ctrl_seg
+					((struct mlx5_wqe_ctrl_seg *)wqe,
+					 entry * 2, MLX5_OPCODE_NOP, 0,
+					 sq->sq_obj.sq->id, 0, 12, 0, 0);
+				wqe += MLX5_REGEX_UMR_WQE_SIZE;
+			}
 			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
 					 (wqe + MLX5_REGEX_WQE_METADATA_OFFSET),
 					 0, queue->metadata->lkey,
@@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
 }
 
 static int
-setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
+setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
 {
+	struct ibv_pd *pd = priv->pd;
 	uint32_t i;
 	int err;
 
@@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		goto err_output;
 	}
 
+	if (priv->has_umr) {
+		ptr = rte_calloc(__func__, qp->nb_desc, MLX5_REGEX_KLMS_SIZE,
+				 MLX5_REGEX_KLMS_SIZE);
+		if (!ptr) {
+			err = -ENOMEM;
+			goto err_imkey;
+		}
+		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
+					MLX5_REGEX_KLMS_SIZE * qp->nb_desc,
+					IBV_ACCESS_LOCAL_WRITE);
+		if (!qp->imkey_addr) {
+			rte_free(ptr);
+			DRV_LOG(ERR, "Failed to register output");
+			err = -EINVAL;
+			goto err_imkey;
+		}
+	}
+
 	/* distribute buffers to jobs */
 	for (i = 0; i < qp->nb_desc; i++) {
 		qp->jobs[i].output =
@@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		qp->jobs[i].metadata =
 			(uint8_t *)qp->metadata->addr +
 			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
+		if (qp->imkey_addr)
+			qp->jobs[i].imkey_array = (struct mlx5_klm *)
+				qp->imkey_addr->addr +
+				(i % qp->nb_desc) * MLX5_REGEX_MAX_KLM_NUM;
 	}
+
 	return 0;
 
+err_imkey:
+	ptr = qp->outputs->addr;
+	rte_free(ptr);
+	mlx5_glue->dereg_mr(qp->outputs);
 err_output:
 	ptr = qp->metadata->addr;
 	rte_free(ptr);
@@ -417,23 +691,57 @@ int
 mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
-	int err;
+	struct mlx5_klm klm = { 0 };
+	struct mlx5_devx_mkey_attr attr = {
+		.klm_array = &klm,
+		.klm_num = 1,
+		.umr_en = 1,
+	};
+	uint32_t i;
+	int err = 0;
 
 	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
 	if (!qp->jobs)
 		return -ENOMEM;
-	err = setup_buffers(qp, priv->pd);
+	err = setup_buffers(priv, qp);
 	if (err) {
 		rte_free(qp->jobs);
 		return err;
 	}
-	setup_sqs(qp);
-	return 0;
+
+	setup_sqs(priv, qp);
+
+	if (priv->has_umr) {
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+		if (regex_get_pdn(priv->pd, &attr.pd)) {
+			err = -rte_errno;
+			DRV_LOG(ERR, "Failed to get pdn.");
+			mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			return err;
+		}
+#endif
+		for (i = 0; i < qp->nb_desc; i++) {
+			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
+			attr.klm_array = qp->jobs[i].imkey_array;
+			qp->jobs[i].imkey = mlx5_devx_cmd_mkey_create(priv->ctx,
+								      &attr);
+			if (!qp->jobs[i].imkey) {
+				err = -rte_errno;
+				DRV_LOG(ERR, "Failed to allocate imkey.");
+				mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			}
+		}
+	}
+	return err;
 }
 
 static void
 free_buffers(struct mlx5_regex_qp *qp)
 {
+	if (qp->imkey_addr) {
+		mlx5_glue->dereg_mr(qp->imkey_addr);
+		rte_free(qp->imkey_addr->addr);
+	}
 	if (qp->metadata) {
 		mlx5_glue->dereg_mr(qp->metadata);
 		rte_free(qp->metadata->addr);
@@ -448,8 +756,14 @@ void
 mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
+	uint32_t i;
 
 	if (qp) {
+		for (i = 0; i < qp->nb_desc; i++) {
+			if (qp->jobs[i].imkey)
+				claim_zero(mlx5_devx_cmd_destroy
+							(qp->jobs[i].imkey));
+		}
 		free_buffers(qp);
 		if (qp->jobs)
 			rte_free(qp->jobs);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH 3/3] app/test-regex: support scattered mbuf input
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 1/3] common/mlx5: add user memory registration bits Suanming Mou
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-09 23:57 ` Suanming Mou
  2021-03-24 21:14 ` [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-09 23:57 UTC (permalink / raw)
  To: viacheslavo, matan, orika; +Cc: rasland, dev

This commits adds the scattered mbuf input support.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 app/test-regex/main.c | 134 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 109 insertions(+), 25 deletions(-)

diff --git a/app/test-regex/main.c b/app/test-regex/main.c
index aea4fa6b88..82cffaacfa 100644
--- a/app/test-regex/main.c
+++ b/app/test-regex/main.c
@@ -35,6 +35,7 @@ enum app_args {
 	ARG_NUM_OF_ITERATIONS,
 	ARG_NUM_OF_QPS,
 	ARG_NUM_OF_LCORES,
+	ARG_NUM_OF_MBUF_SEGS,
 };
 
 struct job_ctx {
@@ -70,6 +71,7 @@ struct regex_conf {
 	char *data_buf;
 	long data_len;
 	long job_len;
+	uint32_t nb_segs;
 };
 
 static void
@@ -82,14 +84,15 @@ usage(const char *prog_name)
 		" --perf N: only outputs the performance data\n"
 		" --nb_iter N: number of iteration to run\n"
 		" --nb_qps N: number of queues to use\n"
-		" --nb_lcores N: number of lcores to use\n",
+		" --nb_lcores N: number of lcores to use\n"
+		" --nb_segs N: number of mbuf segments\n",
 		prog_name);
 }
 
 static void
 args_parse(int argc, char **argv, char *rules_file, char *data_file,
 	   uint32_t *nb_jobs, bool *perf_mode, uint32_t *nb_iterations,
-	   uint32_t *nb_qps, uint32_t *nb_lcores)
+	   uint32_t *nb_qps, uint32_t *nb_lcores, uint32_t *nb_segs)
 {
 	char **argvopt;
 	int opt;
@@ -111,6 +114,8 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		{ "nb_qps", 1, 0, ARG_NUM_OF_QPS},
 		/* Number of lcores. */
 		{ "nb_lcores", 1, 0, ARG_NUM_OF_LCORES},
+		/* Number of mbuf segments. */
+		{ "nb_segs", 1, 0, ARG_NUM_OF_MBUF_SEGS},
 		/* End of options */
 		{ 0, 0, 0, 0 }
 	};
@@ -150,6 +155,9 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		case ARG_NUM_OF_LCORES:
 			*nb_lcores = atoi(optarg);
 			break;
+		case ARG_NUM_OF_MBUF_SEGS:
+			*nb_segs = atoi(optarg);
+			break;
 		case ARG_HELP:
 			usage("RegEx test app");
 			break;
@@ -302,11 +310,75 @@ extbuf_free_cb(void *addr __rte_unused, void *fcb_opaque __rte_unused)
 {
 }
 
+static inline struct rte_mbuf *
+regex_create_segmented_mbuf(struct rte_mempool *mbuf_pool, int pkt_len,
+		int nb_segs, void *buf) {
+
+	struct rte_mbuf *m = NULL, *mbuf = NULL;
+	uint8_t *dst;
+	char *src = buf;
+	int data_len = 0;
+	int i, size;
+	int t_len;
+
+	if (pkt_len < 1) {
+		printf("Packet size must be 1 or more (is %d)\n", pkt_len);
+		return NULL;
+	}
+
+	if (nb_segs < 1) {
+		printf("Number of segments must be 1 or more (is %d)\n",
+				nb_segs);
+		return NULL;
+	}
+
+	t_len = pkt_len >= nb_segs ? (pkt_len / nb_segs +
+				     !!(pkt_len % nb_segs)) : 1;
+	size = pkt_len;
+
+	/* Create chained mbuf_src and fill it with buf data */
+	for (i = 0; size > 0; i++) {
+
+		m = rte_pktmbuf_alloc(mbuf_pool);
+		if (i == 0)
+			mbuf = m;
+
+		if (m == NULL) {
+			printf("Cannot create segment for source mbuf");
+			goto fail;
+		}
+
+		data_len = size > t_len ? t_len : size;
+		memset(rte_pktmbuf_mtod(m, uint8_t *), 0,
+				rte_pktmbuf_tailroom(m));
+		memcpy(rte_pktmbuf_mtod(m, uint8_t *), src, data_len);
+		dst = (uint8_t *)rte_pktmbuf_append(m, data_len);
+		if (dst == NULL) {
+			printf("Cannot append %d bytes to the mbuf\n",
+					data_len);
+			goto fail;
+		}
+
+		if (mbuf != m)
+			rte_pktmbuf_chain(mbuf, m);
+		src += data_len;
+		size -= data_len;
+
+	}
+	return mbuf;
+
+fail:
+	if (mbuf)
+		rte_pktmbuf_free(mbuf);
+	return NULL;
+}
+
 static int
 run_regex(void *args)
 {
 	struct regex_conf *rgxc = args;
 	uint32_t nb_jobs = rgxc->nb_jobs;
+	uint32_t nb_segs = rgxc->nb_segs;
 	uint32_t nb_iterations = rgxc->nb_iterations;
 	uint8_t nb_max_matches = rgxc->nb_max_matches;
 	uint32_t nb_qps = rgxc->nb_qps;
@@ -338,8 +410,12 @@ run_regex(void *args)
 	snprintf(mbuf_pool,
 		 sizeof(mbuf_pool),
 		 "mbuf_pool_%2u", qp_id_base);
-	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool, nb_jobs * nb_qps, 0,
-			0, MBUF_SIZE, rte_socket_id());
+	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool,
+			rte_align32pow2(nb_jobs * nb_qps * nb_segs),
+			0, 0, (nb_segs == 1) ? MBUF_SIZE :
+			(rte_align32pow2(job_len) / nb_segs +
+			RTE_PKTMBUF_HEADROOM),
+			rte_socket_id());
 	if (mbuf_mp == NULL) {
 		printf("Error, can't create memory pool\n");
 		return -ENOMEM;
@@ -375,8 +451,19 @@ run_regex(void *args)
 			goto end;
 		}
 
+		if (clone_buf(data_buf, &buf, data_len)) {
+			printf("Error, can't clone buf.\n");
+			res = -EXIT_FAILURE;
+			goto end;
+		}
+
+		/* Assign each mbuf with the data to handle. */
+		actual_jobs = 0;
+		pos = 0;
 		/* Allocate the jobs and assign each job with an mbuf. */
-		for (i = 0; i < nb_jobs; i++) {
+		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
+			long act_job_len = RTE_MIN(job_len, data_len - pos);
+
 			ops[i] = rte_malloc(NULL, sizeof(*ops[0]) +
 					nb_max_matches *
 					sizeof(struct rte_regexdev_match), 0);
@@ -386,30 +473,26 @@ run_regex(void *args)
 				res = -ENOMEM;
 				goto end;
 			}
-			ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+			if (nb_segs > 1) {
+				ops[i]->mbuf = regex_create_segmented_mbuf
+							(mbuf_mp, act_job_len,
+							 nb_segs, &buf[pos]);
+			} else {
+				ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+				if (ops[i]->mbuf) {
+					rte_pktmbuf_attach_extbuf(ops[i]->mbuf,
+					&buf[pos], 0, act_job_len, &shinfo);
+					ops[i]->mbuf->data_len = job_len;
+					ops[i]->mbuf->pkt_len = act_job_len;
+				}
+			}
 			if (!ops[i]->mbuf) {
-				printf("Error, can't attach mbuf.\n");
+				printf("Error, can't add mbuf.\n");
 				res = -ENOMEM;
 				goto end;
 			}
-		}
 
-		if (clone_buf(data_buf, &buf, data_len)) {
-			printf("Error, can't clone buf.\n");
-			res = -EXIT_FAILURE;
-			goto end;
-		}
-
-		/* Assign each mbuf with the data to handle. */
-		actual_jobs = 0;
-		pos = 0;
-		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
-			long act_job_len = RTE_MIN(job_len, data_len - pos);
-			rte_pktmbuf_attach_extbuf(ops[i]->mbuf, &buf[pos], 0,
-					act_job_len, &shinfo);
 			jobs_ctx[i].mbuf = ops[i]->mbuf;
-			ops[i]->mbuf->data_len = job_len;
-			ops[i]->mbuf->pkt_len = act_job_len;
 			ops[i]->user_id = i;
 			ops[i]->group_id0 = 1;
 			pos += act_job_len;
@@ -612,7 +695,7 @@ main(int argc, char **argv)
 	char *data_buf;
 	long data_len;
 	long job_len;
-	uint32_t nb_lcores = 1;
+	uint32_t nb_lcores = 1, nb_segs = 1;
 	struct regex_conf *rgxc;
 	uint32_t i;
 	struct qps_per_lcore *qps_per_lcore;
@@ -626,7 +709,7 @@ main(int argc, char **argv)
 	if (argc > 1)
 		args_parse(argc, argv, rules_file, data_file, &nb_jobs,
 				&perf_mode, &nb_iterations, &nb_qps,
-				&nb_lcores);
+				&nb_lcores, &nb_segs);
 
 	if (nb_qps == 0)
 		rte_exit(EXIT_FAILURE, "Number of QPs must be greater than 0\n");
@@ -656,6 +739,7 @@ main(int argc, char **argv)
 	for (i = 0; i < nb_lcores; i++) {
 		rgxc[i] = (struct regex_conf){
 			.nb_jobs = nb_jobs,
+			.nb_segs = nb_segs,
 			.perf_mode = perf_mode,
 			.nb_iterations = nb_iterations,
 			.nb_max_matches = nb_max_matches,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (2 preceding siblings ...)
  2021-03-09 23:57 ` [dpdk-dev] [PATCH 3/3] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-03-24 21:14 ` Thomas Monjalon
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Thomas Monjalon @ 2021-03-24 21:14 UTC (permalink / raw)
  To: orika; +Cc: viacheslavo, matan, rasland, dev, Suanming Mou

> Suanming Mou (3):
>   common/mlx5: add user memory registration bits
>   regex/mlx5: add data path scattered mbuf process
>   app/test-regex: support scattered mbuf input

Ori, could you review please?





^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 0/4] regex/mlx5: support scattered mbuf
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (3 preceding siblings ...)
  2021-03-24 21:14 ` [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon
@ 2021-03-25  4:32 ` Suanming Mou
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits Suanming Mou
                     ` (3 more replies)
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (2 subsequent siblings)
  7 siblings, 4 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-25  4:32 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

The scattered mbuf was not supported in mlx5 RegEx driver. This patch
set adds the support of scattered mbuf by UMR WQE.

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

v2:
1. Check mbuf multiple seg by nb_segs.
2. Add ops prefetch.
3. Allocate ops and mbuf memory together in test application.
4. Fix ci and pi in correct issue.


John Hurley (1):
  regex/mlx5: prevent wrong calculation of free sqs in umr mode

Suanming Mou (3):
  common/mlx5: add user memory registration bits
  regex/mlx5: add data path scattered mbuf process
  app/test-regex: support scattered mbuf input

 app/test-regex/main.c                    | 134 ++++++--
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 doc/guides/tools/testregex.rst           |   3 +
 drivers/common/mlx5/linux/meson.build    |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.c     |   5 +
 drivers/common/mlx5/mlx5_devx_cmds.h     |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 380 +++++++++++++++++++++--
 11 files changed, 531 insertions(+), 83 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
@ 2021-03-25  4:32   ` Suanming Mou
  2021-03-29  9:29     ` Ori Kam
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-25  4:32 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commit adds the UMR capability bits.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 2 ++
 drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
 drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 220de35420..5d6a861689 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -186,6 +186,8 @@ has_sym_args = [
 	'mlx5dv_dr_action_create_aso' ],
 	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
 	'INFINIBAND_VERBS_H' ],
+        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
+        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
 ]
 config = configuration_data()
 foreach arg:has_sym_args
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c90e020643..268bcd0d99 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, attr->pd);
 	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
+	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
 	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
@@ -752,6 +753,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						mini_cqe_resp_flow_tag);
 	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
 						 mini_cqe_resp_l3_l4_tag);
+	attr->umr_indirect_mkey_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_indirect_mkey_disabled);
+	attr->umr_modify_entity_size_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_modify_entity_size_disabled);
 	if (attr->qos.sup) {
 		MLX5_SET(query_hca_cap_in, in, op_mod,
 			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 2826c0b2c6..67b5f771c6 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t umr_en:1;
 	struct mlx5_klm *klm_array;
 	int klm_num;
 };
@@ -151,6 +152,8 @@ struct mlx5_hca_attr {
 	uint32_t log_max_mmo_dma:5;
 	uint32_t log_max_mmo_compress:5;
 	uint32_t log_max_mmo_decompress:5;
+	uint32_t umr_modify_entity_size_disabled:1;
+	uint32_t umr_indirect_mkey_disabled:1;
 };
 
 struct mlx5_devx_wq_attr {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-03-25  4:32   ` Suanming Mou
  2021-03-29  9:34     ` Ori Kam
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input Suanming Mou
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-25  4:32 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 doc/guides/tools/testregex.rst           |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
 7 files changed, 410 insertions(+), 58 deletions(-)

diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
index faaa6ac11d..45a0b96980 100644
--- a/doc/guides/regexdevs/mlx5.rst
+++ b/doc/guides/regexdevs/mlx5.rst
@@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device can be probed and used with
 other Mellanox devices, by adding more options in the class.
 For example: ``class=net:regex`` will probe both the net PMD and the RegEx PMD.
 
+Features
+--------
+
+- Multi segments mbuf support.
+
 Supported NICs
 --------------
 
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 3d4b061686..281d4aaa64 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -113,6 +113,10 @@ New Features
   * Added command to display Rx queue used descriptor count.
     ``show port (port_id) rxq (queue_id) desc used count``
 
+* **Updated Mellanox RegEx PMD.**
+
+  * Added support for multi segments mbuf.
+
 
 Removed Items
 -------------
diff --git a/doc/guides/tools/testregex.rst b/doc/guides/tools/testregex.rst
index a59acd919f..cdb1ffd6ee 100644
--- a/doc/guides/tools/testregex.rst
+++ b/doc/guides/tools/testregex.rst
@@ -68,6 +68,9 @@ Application Options
 ``--nb_iter N``
   number of iteration to run
 
+``--nb_segs N``
+  number of mbuf segment
+
 ``--help``
   print application options
 
diff --git a/drivers/regex/mlx5/mlx5_regex.c b/drivers/regex/mlx5/mlx5_regex.c
index ac5b205fa9..82c485e50c 100644
--- a/drivers/regex/mlx5/mlx5_regex.c
+++ b/drivers/regex/mlx5/mlx5_regex.c
@@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	}
 	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
 	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
+#ifdef HAVE_MLX5_UMR_IMKEY
+	if (!attr.umr_indirect_mkey_disabled &&
+	    !attr.umr_modify_entity_size_disabled)
+		priv->has_umr = 1;
+	if (priv->has_umr)
+		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
+#endif
 	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
 	priv->regexdev->device = (struct rte_device *)pci_dev;
 	priv->regexdev->data->dev_private = priv;
@@ -213,6 +220,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	    rte_errno = ENOMEM;
 		goto error;
 	}
+	DRV_LOG(INFO, "RegEx GGA is %s.",
+		priv->has_umr ? "supported" : "unsupported");
 	return 0;
 
 error:
diff --git a/drivers/regex/mlx5/mlx5_regex.h b/drivers/regex/mlx5/mlx5_regex.h
index a2b3f0d9f3..51a2101e53 100644
--- a/drivers/regex/mlx5/mlx5_regex.h
+++ b/drivers/regex/mlx5/mlx5_regex.h
@@ -15,6 +15,7 @@
 #include <mlx5_common_devx.h>
 
 #include "mlx5_rxp.h"
+#include "mlx5_regex_utils.h"
 
 struct mlx5_regex_sq {
 	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
@@ -40,6 +41,7 @@ struct mlx5_regex_qp {
 	struct mlx5_regex_job *jobs;
 	struct ibv_mr *metadata;
 	struct ibv_mr *outputs;
+	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
 	size_t ci, pi;
 	struct mlx5_mr_ctrl mr_ctrl;
 };
@@ -71,8 +73,29 @@ struct mlx5_regex_priv {
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	uint8_t is_bf2; /* The device is BF2 device. */
 	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats. */
+	uint8_t has_umr; /* The device supports UMR. */
 };
 
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+static inline int
+regex_get_pdn(void *pd, uint32_t *pdn)
+{
+	struct mlx5dv_obj obj;
+	struct mlx5dv_pd pd_info;
+	int ret = 0;
+
+	obj.pd.in = pd;
+	obj.pd.out = &pd_info;
+	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
+	if (ret) {
+		DRV_LOG(DEBUG, "Fail to get PD object info");
+		return ret;
+	}
+	*pdn = pd_info.pdn;
+	return 0;
+}
+#endif
+
 /* mlx5_regex.c */
 int mlx5_regex_start(struct rte_regexdev *dev);
 int mlx5_regex_stop(struct rte_regexdev *dev);
@@ -108,5 +131,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
 uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
-
+uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+		       struct rte_regex_ops **ops, uint16_t nb_ops);
 #endif /* MLX5_REGEX_H */
diff --git a/drivers/regex/mlx5/mlx5_regex_control.c b/drivers/regex/mlx5/mlx5_regex_control.c
index 55fbb419ed..eef0fe579d 100644
--- a/drivers/regex/mlx5/mlx5_regex_control.c
+++ b/drivers/regex/mlx5/mlx5_regex_control.c
@@ -27,6 +27,9 @@
 
 #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
 
+#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
+		((has_umr) ? ((log_desc) + 2) : (log_desc))
+
 /**
  * Returns the number of qp obj to be created.
  *
@@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv, struct mlx5_regex_cq *cq)
 	return 0;
 }
 
-#ifdef HAVE_IBV_FLOW_DV_SUPPORT
-static int
-regex_get_pdn(void *pd, uint32_t *pdn)
-{
-	struct mlx5dv_obj obj;
-	struct mlx5dv_pd pd_info;
-	int ret = 0;
-
-	obj.pd.in = pd;
-	obj.pd.out = &pd_info;
-	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
-	if (ret) {
-		DRV_LOG(DEBUG, "Fail to get PD object info");
-		return ret;
-	}
-	*pdn = pd_info.pdn;
-	return 0;
-}
-#endif
-
 /**
  * Destroy the SQ object.
  *
@@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	int ret;
 
 	sq->log_nb_desc = log_nb_desc;
+	sq->sqn = q_ind;
 	sq->ci = 0;
 	sq->pi = 0;
 	ret = regex_get_pdn(priv->pd, &pd_num);
 	if (ret)
 		return ret;
 	attr.wq_attr.pd = pd_num;
-	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
-				  SOCKET_ID_ANY);
+	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
+			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_nb_desc),
+			&attr, SOCKET_ID_ANY);
 	if (ret) {
 		DRV_LOG(ERR, "Can't create SQ object.");
 		rte_errno = ENOMEM;
@@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev, uint16_t qp_ind,
 
 	qp = &priv->qps[qp_ind];
 	qp->flags = cfg->qp_conf_flags;
-	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
-	qp->nb_desc = 1 << qp->cq.log_nb_desc;
+	log_desc = rte_log2_u32(cfg->nb_desc);
+	/*
+	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one descriptor.
+	 * For CQ, expand the CQE number multiple with 2.
+	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4 WQEBBS,
+	 * expand the WQE number multiple with 4.
+	 */
+	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
+	qp->nb_desc = 1 << log_desc;
 	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
-		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
+		qp->nb_obj = regex_ctrl_get_nb_obj
+			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_desc));
 	else
 		qp->nb_obj = 1;
 	qp->sqs = rte_malloc(NULL,
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index beaea7b63f..4f9402c583 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -32,6 +32,15 @@
 #define MLX5_REGEX_WQE_GATHER_OFFSET 32
 #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
 #define MLX5_REGEX_METADATA_OFF 32
+#define MLX5_REGEX_UMR_WQE_SIZE 192
+/* The maximum KLMs can be added to one UMR indirect mkey. */
+#define MLX5_REGEX_MAX_KLM_NUM 128
+/* The KLM array size for one job. */
+#define MLX5_REGEX_KLMS_SIZE \
+	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
+/* In WQE set mode, the pi should be quarter of the MLX5_REGEX_MAX_WQE_INDEX. */
+#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
+	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
 
 static inline uint32_t
 sq_size_get(struct mlx5_regex_sq *sq)
@@ -49,6 +58,8 @@ struct mlx5_regex_job {
 	uint64_t user_id;
 	volatile uint8_t *output;
 	volatile uint8_t *metadata;
+	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
+	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
 } __rte_cached_aligned;
 
 static inline void
@@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
 }
 
 static inline void
-prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
-	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
-	 struct mlx5_regex_job *job)
+__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
+	   size_t pi, struct mlx5_klm *klm)
 {
-	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) * MLX5_SEND_WQE_BB;
-	uint32_t lkey;
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint16_t group0 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
 				op->group_id0 : 0;
 	uint16_t group1 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
@@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
 			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
 		group0 = op->group_id0;
-	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
-				  &priv->mr_scache, &qp->mr_ctrl,
-				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
-				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
 	int ds = 4; /*  ctrl + meta + input + output */
 
-	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
+	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
+			 (priv->has_umr ? (pi * 4 + 3) : pi),
 			 MLX5_OPCODE_MMO, MLX5_OPC_MOD_MMO_REGEX,
 			 sq->sq_obj.sq->id, 0, ds, 0, 0);
 	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
@@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	struct mlx5_wqe_data_seg *input_seg =
 		(struct mlx5_wqe_data_seg *)(wqe +
 					     MLX5_REGEX_WQE_GATHER_OFFSET);
-	input_seg->byte_count =
-		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
-	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
-							    uintptr_t));
-	input_seg->lkey = lkey;
+	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
+	input_seg->addr = rte_cpu_to_be_64(klm->address);
+	input_seg->lkey = klm->mkey;
 	job->user_id = op->user_id;
+}
+
+static inline void
+prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
+	 struct mlx5_regex_job *job)
+{
+	struct mlx5_klm klm;
+
+	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
+	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
+				  &priv->mr_scache, &qp->mr_ctrl,
+				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
+				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
+	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
+	__prep_one(priv, sq, op, job, sq->pi, &klm);
 	sq->db_pi = sq->pi;
 	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
 }
 
 static inline void
-send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
+send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 {
+	struct mlx5dv_devx_uar *uar = priv->uar;
 	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
-		MLX5_SEND_WQE_BB;
+		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
-	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
+	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |= MLX5_WQE_CTRL_CQ_UPDATE;
 	uint64_t *doorbell_addr =
 		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
 	rte_io_wmb();
-	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi + 1) &
-						 MLX5_REGEX_MAX_WQE_INDEX);
+	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv->has_umr ?
+					(sq->db_pi * 4 + 3) : sq->db_pi) &
+					MLX5_REGEX_MAX_WQE_INDEX);
 	rte_wmb();
 	*doorbell_addr = *(volatile uint64_t *)wqe;
 	rte_wmb();
 }
 
 static inline int
-can_send(struct mlx5_regex_sq *sq) {
-	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
+get_free(struct mlx5_regex_sq *sq) {
+	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
 }
 
 static inline uint32_t
@@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
 	return qid * sq_size + (index & (sq_size - 1));
 }
 
+#ifdef HAVE_MLX5_UMR_IMKEY
+static inline int
+mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
+{
+	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
+}
+
+static inline void
+complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
+		 struct mlx5_regex_job *mkey_job,
+		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
+{
+	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
+		(MLX5_SEND_WQE_BB * 4);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+	struct mlx5_wqe_umr_ctrl_seg *ucseg =
+				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
+	struct mlx5_wqe_mkey_context_seg *mkc =
+				(struct mlx5_wqe_mkey_context_seg *)(ucseg + 1);
+	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
+	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
+
+	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9 WQE_DS. */
+	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
+			 0, sq->sq_obj.sq->id, 0, 9, 0,
+			 rte_cpu_to_be_32(mkey_job->imkey->id));
+	/* Set UMR WQE control seg. */
+	ucseg->mkey_mask |= rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
+				MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
+				MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
+	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
+	/* Set mkey context seg. */
+	mkc->len = rte_cpu_to_be_64(total_len);
+	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
+					(mkey_job->imkey->id & 0xff));
+	/* Set UMR pointer to data seg. */
+	iklm->address = rte_cpu_to_be_64
+				((uintptr_t)((char *)mkey_job->imkey_array));
+	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
+	iklm->byte_count = rte_cpu_to_be_32(klm_align);
+	/* Clear the padding memory. */
+	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
+	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
+
+	/* Add the following RegEx WQE with fence. */
+	wqe = (struct mlx5_wqe_ctrl_seg *)
+				(((uint8_t *)wqe) + MLX5_REGEX_UMR_WQE_SIZE);
+	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
+}
+
+static inline void
+prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
+		       size_t pi, struct mlx5_klm *klm)
+{
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << 2);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+
+	/* Clear the WQE memory used as UMR WQE previously. */
+	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) != MLX5_OPCODE_NOP)
+		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
+	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq->id,
+			 0, 12, 0, 0);
+	__prep_one(priv, sq, op, job, pi, klm);
+}
+
+static inline void
+prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
+{
+	struct mlx5_regex_job *job = NULL;
+	size_t sqid = sq->sqn, mkey_job_id = 0;
+	size_t left_ops = nb_ops;
+	uint32_t klm_num = 0, len;
+	struct mlx5_klm *mkey_klm = NULL;
+	struct mlx5_klm klm;
+
+	sqid = sq->sqn;
+	while (left_ops--)
+		rte_prefetch0(op[left_ops]);
+	left_ops = nb_ops;
+	/*
+	 * Build the WQE set by reverse. In case the burst may consume
+	 * multiple mkeys, build the WQE set as normal will hard to
+	 * address the last mkey index, since we will only know the last
+	 * RegEx WQE's index when finishes building.
+	 */
+	while (left_ops--) {
+		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
+		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
+
+		if (mbuf->nb_segs > 1) {
+			size_t scatter_size = 0;
+
+			if (!mkey_klm_available(mkey_klm, klm_num,
+						mbuf->nb_segs)) {
+				/*
+				 * The mkey's KLM is full, create the UMR
+				 * WQE in the next WQE set.
+				 */
+				if (mkey_klm)
+					complete_umr_wqe(qp, sq,
+						&qp->jobs[mkey_job_id],
+						MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
+						klm_num, len);
+				/*
+				 * Get the indircet mkey and KLM array index
+				 * from the last WQE set.
+				 */
+				mkey_job_id = job_id_get(sqid,
+							 sq_size_get(sq), pi);
+				mkey_klm = qp->jobs[mkey_job_id].imkey_array;
+				klm_num = 0;
+				len = 0;
+			}
+			/* Build RegEx WQE's data segment KLM. */
+			klm.address = len;
+			klm.mkey = rte_cpu_to_be_32
+					(qp->jobs[mkey_job_id].imkey->id);
+			while (mbuf) {
+				/* Build indirect mkey seg's KLM. */
+				mkey_klm->mkey = mlx5_mr_addr2mr_bh(priv->pd,
+					NULL, &priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+				mkey_klm->address = rte_cpu_to_be_64
+					(rte_pktmbuf_mtod(mbuf, uintptr_t));
+				mkey_klm->byte_count = rte_cpu_to_be_32
+						(rte_pktmbuf_data_len(mbuf));
+				/*
+				 * Save the mbuf's total size for RegEx data
+				 * segment.
+				 */
+				scatter_size += rte_pktmbuf_data_len(mbuf);
+				mkey_klm++;
+				klm_num++;
+				mbuf = mbuf->next;
+			}
+			len += scatter_size;
+			klm.byte_count = scatter_size;
+		} else {
+			/* The single mubf case. Build the KLM directly. */
+			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
+					&priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
+			klm.byte_count = rte_pktmbuf_data_len(mbuf);
+		}
+		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
+		/*
+		 * Build the nop + RegEx WQE set by default. The fist nop WQE
+		 * will be updated later as UMR WQE if scattered mubf exist.
+		 */
+		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
+	}
+	/*
+	 * Scattered mbuf have been added to the KLM array. Complete the build
+	 * of UMR WQE, update the first nop WQE as UMR WQE.
+	 */
+	if (mkey_klm)
+		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
+				 klm_num, len);
+	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
+	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
+}
+
+uint16_t
+mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+			  struct rte_regex_ops **ops, uint16_t nb_ops)
+{
+	struct mlx5_regex_priv *priv = dev->data->dev_private;
+	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
+	struct mlx5_regex_sq *sq;
+	size_t sqid, nb_left = nb_ops, nb_desc;
+
+	while ((sqid = ffs(queue->free_sqs))) {
+		sqid--; /* ffs returns 1 for bit 0 */
+		sq = &queue->sqs[sqid];
+		nb_desc = get_free(sq);
+		if (nb_desc) {
+			/* The ops be handled can't exceed nb_ops. */
+			if (nb_desc > nb_left)
+				nb_desc = nb_left;
+			else
+				queue->free_sqs &= ~(1 << sqid);
+			prep_regex_umr_wqe_set(priv, queue, sq, ops, nb_desc);
+			send_doorbell(priv, sq);
+			nb_left -= nb_desc;
+		}
+		if (!nb_left)
+			break;
+		ops += nb_desc;
+	}
+	nb_ops -= nb_left;
+	queue->pi += nb_ops;
+	return nb_ops;
+}
+#endif
+
 uint16_t
 mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		      struct rte_regex_ops **ops, uint16_t nb_ops)
@@ -186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (can_send(sq)) {
+		while (get_free(sq)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
 			if (unlikely(i == nb_ops)) {
-				send_doorbell(priv->uar, sq);
+				send_doorbell(priv, sq);
 				goto out;
 			}
 		}
 		queue->free_sqs &= ~(1 << sqid);
-		send_doorbell(priv->uar, sq);
+		send_doorbell(priv, sq);
 	}
 
 out:
@@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			  MLX5_REGEX_MAX_WQE_INDEX;
 		size_t sqid = cqe->rsvd3[2];
 		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
+
+		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
+		if (priv->has_umr)
+			wq_counter >>= 2;
 		while (sq->ci != wq_counter) {
 			if (unlikely(i == nb_ops)) {
 				/* Return without updating cq->ci */
@@ -316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
 						     sq->ci);
 			extract_result(ops[i], &queue->jobs[job_id]);
-			sq->ci = (sq->ci + 1) & MLX5_REGEX_MAX_WQE_INDEX;
+			sq->ci = (sq->ci + 1) & (priv->has_umr ?
+				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+				  MLX5_REGEX_MAX_WQE_INDEX);
 			i++;
 		}
 		cq->ci = (cq->ci + 1) & 0xffffff;
@@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 }
 
 static void
-setup_sqs(struct mlx5_regex_qp *queue)
+setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
 {
 	size_t sqid, entry;
 	uint32_t job_id;
@@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
 			job_id = sqid * sq_size_get(sq) + entry;
 			struct mlx5_regex_job *job = &queue->jobs[job_id];
 
+			/* Fill UMR WQE with NOP in advanced. */
+			if (priv->has_umr) {
+				set_wqe_ctrl_seg
+					((struct mlx5_wqe_ctrl_seg *)wqe,
+					 entry * 2, MLX5_OPCODE_NOP, 0,
+					 sq->sq_obj.sq->id, 0, 12, 0, 0);
+				wqe += MLX5_REGEX_UMR_WQE_SIZE;
+			}
 			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
 					 (wqe + MLX5_REGEX_WQE_METADATA_OFFSET),
 					 0, queue->metadata->lkey,
@@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
 }
 
 static int
-setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
+setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
 {
+	struct ibv_pd *pd = priv->pd;
 	uint32_t i;
 	int err;
 
@@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		goto err_output;
 	}
 
+	if (priv->has_umr) {
+		ptr = rte_calloc(__func__, qp->nb_desc, MLX5_REGEX_KLMS_SIZE,
+				 MLX5_REGEX_KLMS_SIZE);
+		if (!ptr) {
+			err = -ENOMEM;
+			goto err_imkey;
+		}
+		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
+					MLX5_REGEX_KLMS_SIZE * qp->nb_desc,
+					IBV_ACCESS_LOCAL_WRITE);
+		if (!qp->imkey_addr) {
+			rte_free(ptr);
+			DRV_LOG(ERR, "Failed to register output");
+			err = -EINVAL;
+			goto err_imkey;
+		}
+	}
+
 	/* distribute buffers to jobs */
 	for (i = 0; i < qp->nb_desc; i++) {
 		qp->jobs[i].output =
@@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		qp->jobs[i].metadata =
 			(uint8_t *)qp->metadata->addr +
 			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
+		if (qp->imkey_addr)
+			qp->jobs[i].imkey_array = (struct mlx5_klm *)
+				qp->imkey_addr->addr +
+				(i % qp->nb_desc) * MLX5_REGEX_MAX_KLM_NUM;
 	}
+
 	return 0;
 
+err_imkey:
+	ptr = qp->outputs->addr;
+	rte_free(ptr);
+	mlx5_glue->dereg_mr(qp->outputs);
 err_output:
 	ptr = qp->metadata->addr;
 	rte_free(ptr);
@@ -417,23 +691,57 @@ int
 mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
-	int err;
+	struct mlx5_klm klm = { 0 };
+	struct mlx5_devx_mkey_attr attr = {
+		.klm_array = &klm,
+		.klm_num = 1,
+		.umr_en = 1,
+	};
+	uint32_t i;
+	int err = 0;
 
 	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
 	if (!qp->jobs)
 		return -ENOMEM;
-	err = setup_buffers(qp, priv->pd);
+	err = setup_buffers(priv, qp);
 	if (err) {
 		rte_free(qp->jobs);
 		return err;
 	}
-	setup_sqs(qp);
-	return 0;
+
+	setup_sqs(priv, qp);
+
+	if (priv->has_umr) {
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+		if (regex_get_pdn(priv->pd, &attr.pd)) {
+			err = -rte_errno;
+			DRV_LOG(ERR, "Failed to get pdn.");
+			mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			return err;
+		}
+#endif
+		for (i = 0; i < qp->nb_desc; i++) {
+			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
+			attr.klm_array = qp->jobs[i].imkey_array;
+			qp->jobs[i].imkey = mlx5_devx_cmd_mkey_create(priv->ctx,
+								      &attr);
+			if (!qp->jobs[i].imkey) {
+				err = -rte_errno;
+				DRV_LOG(ERR, "Failed to allocate imkey.");
+				mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			}
+		}
+	}
+	return err;
 }
 
 static void
 free_buffers(struct mlx5_regex_qp *qp)
 {
+	if (qp->imkey_addr) {
+		mlx5_glue->dereg_mr(qp->imkey_addr);
+		rte_free(qp->imkey_addr->addr);
+	}
 	if (qp->metadata) {
 		mlx5_glue->dereg_mr(qp->metadata);
 		rte_free(qp->metadata->addr);
@@ -448,8 +756,14 @@ void
 mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
+	uint32_t i;
 
 	if (qp) {
+		for (i = 0; i < qp->nb_desc; i++) {
+			if (qp->jobs[i].imkey)
+				claim_zero(mlx5_devx_cmd_destroy
+							(qp->jobs[i].imkey));
+		}
 		free_buffers(qp);
 		if (qp->jobs)
 			rte_free(qp->jobs);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits Suanming Mou
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-25  4:32   ` Suanming Mou
  2021-03-29  9:27     ` Ori Kam
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-25  4:32 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commits adds the scattered mbuf input support.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
---
 app/test-regex/main.c | 134 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 109 insertions(+), 25 deletions(-)

diff --git a/app/test-regex/main.c b/app/test-regex/main.c
index aea4fa6b88..82cffaacfa 100644
--- a/app/test-regex/main.c
+++ b/app/test-regex/main.c
@@ -35,6 +35,7 @@ enum app_args {
 	ARG_NUM_OF_ITERATIONS,
 	ARG_NUM_OF_QPS,
 	ARG_NUM_OF_LCORES,
+	ARG_NUM_OF_MBUF_SEGS,
 };
 
 struct job_ctx {
@@ -70,6 +71,7 @@ struct regex_conf {
 	char *data_buf;
 	long data_len;
 	long job_len;
+	uint32_t nb_segs;
 };
 
 static void
@@ -82,14 +84,15 @@ usage(const char *prog_name)
 		" --perf N: only outputs the performance data\n"
 		" --nb_iter N: number of iteration to run\n"
 		" --nb_qps N: number of queues to use\n"
-		" --nb_lcores N: number of lcores to use\n",
+		" --nb_lcores N: number of lcores to use\n"
+		" --nb_segs N: number of mbuf segments\n",
 		prog_name);
 }
 
 static void
 args_parse(int argc, char **argv, char *rules_file, char *data_file,
 	   uint32_t *nb_jobs, bool *perf_mode, uint32_t *nb_iterations,
-	   uint32_t *nb_qps, uint32_t *nb_lcores)
+	   uint32_t *nb_qps, uint32_t *nb_lcores, uint32_t *nb_segs)
 {
 	char **argvopt;
 	int opt;
@@ -111,6 +114,8 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		{ "nb_qps", 1, 0, ARG_NUM_OF_QPS},
 		/* Number of lcores. */
 		{ "nb_lcores", 1, 0, ARG_NUM_OF_LCORES},
+		/* Number of mbuf segments. */
+		{ "nb_segs", 1, 0, ARG_NUM_OF_MBUF_SEGS},
 		/* End of options */
 		{ 0, 0, 0, 0 }
 	};
@@ -150,6 +155,9 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		case ARG_NUM_OF_LCORES:
 			*nb_lcores = atoi(optarg);
 			break;
+		case ARG_NUM_OF_MBUF_SEGS:
+			*nb_segs = atoi(optarg);
+			break;
 		case ARG_HELP:
 			usage("RegEx test app");
 			break;
@@ -302,11 +310,75 @@ extbuf_free_cb(void *addr __rte_unused, void *fcb_opaque __rte_unused)
 {
 }
 
+static inline struct rte_mbuf *
+regex_create_segmented_mbuf(struct rte_mempool *mbuf_pool, int pkt_len,
+		int nb_segs, void *buf) {
+
+	struct rte_mbuf *m = NULL, *mbuf = NULL;
+	uint8_t *dst;
+	char *src = buf;
+	int data_len = 0;
+	int i, size;
+	int t_len;
+
+	if (pkt_len < 1) {
+		printf("Packet size must be 1 or more (is %d)\n", pkt_len);
+		return NULL;
+	}
+
+	if (nb_segs < 1) {
+		printf("Number of segments must be 1 or more (is %d)\n",
+				nb_segs);
+		return NULL;
+	}
+
+	t_len = pkt_len >= nb_segs ? (pkt_len / nb_segs +
+				     !!(pkt_len % nb_segs)) : 1;
+	size = pkt_len;
+
+	/* Create chained mbuf_src and fill it with buf data */
+	for (i = 0; size > 0; i++) {
+
+		m = rte_pktmbuf_alloc(mbuf_pool);
+		if (i == 0)
+			mbuf = m;
+
+		if (m == NULL) {
+			printf("Cannot create segment for source mbuf");
+			goto fail;
+		}
+
+		data_len = size > t_len ? t_len : size;
+		memset(rte_pktmbuf_mtod(m, uint8_t *), 0,
+				rte_pktmbuf_tailroom(m));
+		memcpy(rte_pktmbuf_mtod(m, uint8_t *), src, data_len);
+		dst = (uint8_t *)rte_pktmbuf_append(m, data_len);
+		if (dst == NULL) {
+			printf("Cannot append %d bytes to the mbuf\n",
+					data_len);
+			goto fail;
+		}
+
+		if (mbuf != m)
+			rte_pktmbuf_chain(mbuf, m);
+		src += data_len;
+		size -= data_len;
+
+	}
+	return mbuf;
+
+fail:
+	if (mbuf)
+		rte_pktmbuf_free(mbuf);
+	return NULL;
+}
+
 static int
 run_regex(void *args)
 {
 	struct regex_conf *rgxc = args;
 	uint32_t nb_jobs = rgxc->nb_jobs;
+	uint32_t nb_segs = rgxc->nb_segs;
 	uint32_t nb_iterations = rgxc->nb_iterations;
 	uint8_t nb_max_matches = rgxc->nb_max_matches;
 	uint32_t nb_qps = rgxc->nb_qps;
@@ -338,8 +410,12 @@ run_regex(void *args)
 	snprintf(mbuf_pool,
 		 sizeof(mbuf_pool),
 		 "mbuf_pool_%2u", qp_id_base);
-	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool, nb_jobs * nb_qps, 0,
-			0, MBUF_SIZE, rte_socket_id());
+	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool,
+			rte_align32pow2(nb_jobs * nb_qps * nb_segs),
+			0, 0, (nb_segs == 1) ? MBUF_SIZE :
+			(rte_align32pow2(job_len) / nb_segs +
+			RTE_PKTMBUF_HEADROOM),
+			rte_socket_id());
 	if (mbuf_mp == NULL) {
 		printf("Error, can't create memory pool\n");
 		return -ENOMEM;
@@ -375,8 +451,19 @@ run_regex(void *args)
 			goto end;
 		}
 
+		if (clone_buf(data_buf, &buf, data_len)) {
+			printf("Error, can't clone buf.\n");
+			res = -EXIT_FAILURE;
+			goto end;
+		}
+
+		/* Assign each mbuf with the data to handle. */
+		actual_jobs = 0;
+		pos = 0;
 		/* Allocate the jobs and assign each job with an mbuf. */
-		for (i = 0; i < nb_jobs; i++) {
+		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
+			long act_job_len = RTE_MIN(job_len, data_len - pos);
+
 			ops[i] = rte_malloc(NULL, sizeof(*ops[0]) +
 					nb_max_matches *
 					sizeof(struct rte_regexdev_match), 0);
@@ -386,30 +473,26 @@ run_regex(void *args)
 				res = -ENOMEM;
 				goto end;
 			}
-			ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+			if (nb_segs > 1) {
+				ops[i]->mbuf = regex_create_segmented_mbuf
+							(mbuf_mp, act_job_len,
+							 nb_segs, &buf[pos]);
+			} else {
+				ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+				if (ops[i]->mbuf) {
+					rte_pktmbuf_attach_extbuf(ops[i]->mbuf,
+					&buf[pos], 0, act_job_len, &shinfo);
+					ops[i]->mbuf->data_len = job_len;
+					ops[i]->mbuf->pkt_len = act_job_len;
+				}
+			}
 			if (!ops[i]->mbuf) {
-				printf("Error, can't attach mbuf.\n");
+				printf("Error, can't add mbuf.\n");
 				res = -ENOMEM;
 				goto end;
 			}
-		}
 
-		if (clone_buf(data_buf, &buf, data_len)) {
-			printf("Error, can't clone buf.\n");
-			res = -EXIT_FAILURE;
-			goto end;
-		}
-
-		/* Assign each mbuf with the data to handle. */
-		actual_jobs = 0;
-		pos = 0;
-		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
-			long act_job_len = RTE_MIN(job_len, data_len - pos);
-			rte_pktmbuf_attach_extbuf(ops[i]->mbuf, &buf[pos], 0,
-					act_job_len, &shinfo);
 			jobs_ctx[i].mbuf = ops[i]->mbuf;
-			ops[i]->mbuf->data_len = job_len;
-			ops[i]->mbuf->pkt_len = act_job_len;
 			ops[i]->user_id = i;
 			ops[i]->group_id0 = 1;
 			pos += act_job_len;
@@ -612,7 +695,7 @@ main(int argc, char **argv)
 	char *data_buf;
 	long data_len;
 	long job_len;
-	uint32_t nb_lcores = 1;
+	uint32_t nb_lcores = 1, nb_segs = 1;
 	struct regex_conf *rgxc;
 	uint32_t i;
 	struct qps_per_lcore *qps_per_lcore;
@@ -626,7 +709,7 @@ main(int argc, char **argv)
 	if (argc > 1)
 		args_parse(argc, argv, rules_file, data_file, &nb_jobs,
 				&perf_mode, &nb_iterations, &nb_qps,
-				&nb_lcores);
+				&nb_lcores, &nb_segs);
 
 	if (nb_qps == 0)
 		rte_exit(EXIT_FAILURE, "Number of QPs must be greater than 0\n");
@@ -656,6 +739,7 @@ main(int argc, char **argv)
 	for (i = 0; i < nb_lcores; i++) {
 		rgxc[i] = (struct regex_conf){
 			.nb_jobs = nb_jobs,
+			.nb_segs = nb_segs,
 			.perf_mode = perf_mode,
 			.nb_iterations = nb_iterations,
 			.nb_max_matches = nb_max_matches,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
                     ` (2 preceding siblings ...)
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-03-25  4:32   ` Suanming Mou
  2021-03-29  9:35     ` Ori Kam
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-25  4:32 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland, John Hurley

From: John Hurley <jhurley@nvidia.com>

A recent change adds support for scattered mbuf and UMR support for regex.
Part of this commit makes the pi and ci counters of the regex_sq a quarter
of the length in non umr mode, effectively moving them from 16 bits to
14. The new get_free method casts the difference in pi and ci to a 16 bit
value when calculating the free send queues, accounting for any wrapping
when pi has looped back to 0 but ci has not yet. However, the move to 14
bits while still casting to 16 can now lead to corrupted, large values
returned.

Modify the get_free function to take in the has_umr flag and, accordingly,
account for wrapping on either 14 or 16 bit pi/ci difference.

Fixes: a20fe8e74dea ("regex/mlx5: add data path scattered mbuf process")
Signed-off-by: John Hurley <jhurley@nvidia.com>
---
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index 4f9402c583..b57e7d7794 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -192,8 +192,10 @@ send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 }
 
 static inline int
-get_free(struct mlx5_regex_sq *sq) {
-	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
+get_free(struct mlx5_regex_sq *sq, uint8_t has_umr) {
+	return (sq_size_get(sq) - ((sq->pi - sq->ci) &
+			(has_umr ? (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+			MLX5_REGEX_MAX_WQE_INDEX)));
 }
 
 static inline uint32_t
@@ -385,7 +387,7 @@ mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		nb_desc = get_free(sq);
+		nb_desc = get_free(sq, priv->has_umr);
 		if (nb_desc) {
 			/* The ops be handled can't exceed nb_ops. */
 			if (nb_desc > nb_left)
@@ -418,7 +420,7 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (get_free(sq)) {
+		while (get_free(sq, priv->has_umr)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-03-29  9:27     ` Ori Kam
  0 siblings, 0 replies; 36+ messages in thread
From: Ori Kam @ 2021-03-29  9:27 UTC (permalink / raw)
  To: Suanming Mou; +Cc: dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh

Hi Mou,

> -----Original Message-----
> From: Suanming Mou <suanmingm@nvidia.com>
> 
> This commits adds the scattered mbuf input support.
> 
> Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
> ---
>  app/test-regex/main.c | 134 ++++++++++++++++++++++++++++++++++--------
>  1 file changed, 109 insertions(+), 25 deletions(-)
> 

Acked-by: Ori Kam <orika@nvidia.com>
Best,
Ori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-03-29  9:29     ` Ori Kam
  0 siblings, 0 replies; 36+ messages in thread
From: Ori Kam @ 2021-03-29  9:29 UTC (permalink / raw)
  To: Suanming Mou; +Cc: dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh

Hi,

> -----Original Message-----
> From: Suanming Mou <suanmingm@nvidia.com>
> This commit adds the UMR capability bits.
> 
> Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
> ---
>  drivers/common/mlx5/linux/meson.build | 2 ++
>  drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
>  drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
>  3 files changed, 10 insertions(+)
> 
> diff --git a/drivers/common/mlx5/linux/meson.build
> b/drivers/common/mlx5/linux/meson.build
> index 220de35420..5d6a861689 100644
> --- a/drivers/common/mlx5/linux/meson.build
> +++ b/drivers/common/mlx5/linux/meson.build
> @@ -186,6 +186,8 @@ has_sym_args = [
>  	'mlx5dv_dr_action_create_aso' ],
>  	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
>  	'INFINIBAND_VERBS_H' ],
> +        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
> +        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
>  ]
>  config = configuration_data()
>  foreach arg:has_sym_args
> diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c
> b/drivers/common/mlx5/mlx5_devx_cmds.c
> index c90e020643..268bcd0d99 100644
> --- a/drivers/common/mlx5/mlx5_devx_cmds.c
> +++ b/drivers/common/mlx5/mlx5_devx_cmds.c
> @@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
>  	MLX5_SET(mkc, mkc, qpn, 0xffffff);
>  	MLX5_SET(mkc, mkc, pd, attr->pd);
>  	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
> +	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
>  	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
>  	MLX5_SET(mkc, mkc, relaxed_ordering_write,
>  		 attr->relaxed_ordering_write);
> @@ -752,6 +753,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
>  						mini_cqe_resp_flow_tag);
>  	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
>  						 mini_cqe_resp_l3_l4_tag);
> +	attr->umr_indirect_mkey_disabled =
> +		MLX5_GET(cmd_hca_cap, hcattr,
> umr_indirect_mkey_disabled);
> +	attr->umr_modify_entity_size_disabled =
> +		MLX5_GET(cmd_hca_cap, hcattr,
> umr_modify_entity_size_disabled);
>  	if (attr->qos.sup) {
>  		MLX5_SET(query_hca_cap_in, in, op_mod,
>  			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
> diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h
> b/drivers/common/mlx5/mlx5_devx_cmds.h
> index 2826c0b2c6..67b5f771c6 100644
> --- a/drivers/common/mlx5/mlx5_devx_cmds.h
> +++ b/drivers/common/mlx5/mlx5_devx_cmds.h
> @@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
>  	uint32_t pg_access:1;
>  	uint32_t relaxed_ordering_write:1;
>  	uint32_t relaxed_ordering_read:1;
> +	uint32_t umr_en:1;
>  	struct mlx5_klm *klm_array;
>  	int klm_num;
>  };
> @@ -151,6 +152,8 @@ struct mlx5_hca_attr {
>  	uint32_t log_max_mmo_dma:5;
>  	uint32_t log_max_mmo_compress:5;
>  	uint32_t log_max_mmo_decompress:5;
> +	uint32_t umr_modify_entity_size_disabled:1;
> +	uint32_t umr_indirect_mkey_disabled:1;
>  };
> 
>  struct mlx5_devx_wq_attr {
> --
> 2.25.1

Acked-by: Ori Kam <orika@nvidia.com>
Best,
Ori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-29  9:34     ` Ori Kam
  2021-03-29  9:52       ` Suanming Mou
  0 siblings, 1 reply; 36+ messages in thread
From: Ori Kam @ 2021-03-29  9:34 UTC (permalink / raw)
  To: Suanming Mou; +Cc: dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh

Hi Mou,

PSB one small comment,
Please update and feel free to add my ack.

> -----Original Message-----
> From: Suanming Mou <suanmingm@nvidia.com>
> 
> UMR WQE can convert multiple mkey's memory sapce to contiguous space.
> Take advantage of the UMR WQE, scattered mbuf in one operation can be
> converted to an indirect mkey. The RegEx which only accepts one mkey
> can now process the whole scattered mbuf.
> 
> The maximum scattered mbuf can be supported in one UMR WQE is now
> defined as 64. Multiple operations scattered mbufs can be add to one
> UMR WQE if there is enough space in the KLM array, since the operations
> can address their own mbuf's content by the mkey's address and length.
> However, one operation's scattered mbuf's can't be placed in two
> different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
> free space for one operation, a new UMR WQE will be required.
> 
> In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
> WQE move, the meky's index used by the UMR WQE should be the index of
> last the RegEX WQE in the operations. As one operation consumes one
> WQE set, build the RegEx WQE by reverse helps address the mkey more
> efficiently. Once the operations in one burst consumes multiple mkeys,
> when the mkey KLM array is full, the reverse WQE set index will always
> be the last of the new mkey's for the new UMR WQE.
> 
> In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
> WQE by
> interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
> and ci will also be increased as WQE set not as WQE.
> 
> For operations don't have scattered mbuf, uses the mbuf's mkey directly,
> the WQE set combination is NOP + RegEx.
> For operations have scattered mubf but share the UMR WQE with others,
> the WQE set combination is NOP + RegEx.
> For operations complete the UMR WQE, the WQE set combination is UMR +
> RegEx.
> 
> Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
> ---
>  doc/guides/regexdevs/mlx5.rst            |   5 +
>  doc/guides/rel_notes/release_21_05.rst   |   4 +
>  doc/guides/tools/testregex.rst           |   3 +
>  drivers/regex/mlx5/mlx5_regex.c          |   9 +
>  drivers/regex/mlx5/mlx5_regex.h          |  26 +-
>  drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
>  drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
>  7 files changed, 410 insertions(+), 58 deletions(-)
> 
> diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
> index faaa6ac11d..45a0b96980 100644
> --- a/doc/guides/regexdevs/mlx5.rst
> +++ b/doc/guides/regexdevs/mlx5.rst
> @@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device can
> be probed and used with
>  other Mellanox devices, by adding more options in the class.
>  For example: ``class=net:regex`` will probe both the net PMD and the RegEx
> PMD.
> 
> +Features
> +--------
> +
> +- Multi segments mbuf support.
> +
>  Supported NICs
>  --------------
> 
> diff --git a/doc/guides/rel_notes/release_21_05.rst
> b/doc/guides/rel_notes/release_21_05.rst
> index 3d4b061686..281d4aaa64 100644
> --- a/doc/guides/rel_notes/release_21_05.rst
> +++ b/doc/guides/rel_notes/release_21_05.rst
> @@ -113,6 +113,10 @@ New Features
>    * Added command to display Rx queue used descriptor count.
>      ``show port (port_id) rxq (queue_id) desc used count``
> 
> +* **Updated Mellanox RegEx PMD.**
> +
> +  * Added support for multi segments mbuf.
> +
> 
>  Removed Items
>  -------------
> diff --git a/doc/guides/tools/testregex.rst b/doc/guides/tools/testregex.rst
> index a59acd919f..cdb1ffd6ee 100644
> --- a/doc/guides/tools/testregex.rst
> +++ b/doc/guides/tools/testregex.rst
> @@ -68,6 +68,9 @@ Application Options
>  ``--nb_iter N``
>    number of iteration to run
> 
> +``--nb_segs N``
> +  number of mbuf segment
> +
>  ``--help``
>    print application options

I don't think this is part of this patch.
It should belong to the app patch.

> diff --git a/drivers/regex/mlx5/mlx5_regex.c
> b/drivers/regex/mlx5/mlx5_regex.c
> index ac5b205fa9..82c485e50c 100644
> --- a/drivers/regex/mlx5/mlx5_regex.c
> +++ b/drivers/regex/mlx5/mlx5_regex.c
> @@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv
> __rte_unused,
>  	}
>  	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
>  	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
> +#ifdef HAVE_MLX5_UMR_IMKEY
> +	if (!attr.umr_indirect_mkey_disabled &&
> +	    !attr.umr_modify_entity_size_disabled)
> +		priv->has_umr = 1;
> +	if (priv->has_umr)
> +		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
> +#endif
>  	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
>  	priv->regexdev->device = (struct rte_device *)pci_dev;
>  	priv->regexdev->data->dev_private = priv;
> @@ -213,6 +220,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv
> __rte_unused,
>  	    rte_errno = ENOMEM;
>  		goto error;
>  	}
> +	DRV_LOG(INFO, "RegEx GGA is %s.",
> +		priv->has_umr ? "supported" : "unsupported");
>  	return 0;
> 
>  error:
> diff --git a/drivers/regex/mlx5/mlx5_regex.h
> b/drivers/regex/mlx5/mlx5_regex.h
> index a2b3f0d9f3..51a2101e53 100644
> --- a/drivers/regex/mlx5/mlx5_regex.h
> +++ b/drivers/regex/mlx5/mlx5_regex.h
> @@ -15,6 +15,7 @@
>  #include <mlx5_common_devx.h>
> 
>  #include "mlx5_rxp.h"
> +#include "mlx5_regex_utils.h"
> 
>  struct mlx5_regex_sq {
>  	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
> @@ -40,6 +41,7 @@ struct mlx5_regex_qp {
>  	struct mlx5_regex_job *jobs;
>  	struct ibv_mr *metadata;
>  	struct ibv_mr *outputs;
> +	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
>  	size_t ci, pi;
>  	struct mlx5_mr_ctrl mr_ctrl;
>  };
> @@ -71,8 +73,29 @@ struct mlx5_regex_priv {
>  	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache.
> */
>  	uint8_t is_bf2; /* The device is BF2 device. */
>  	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats. */
> +	uint8_t has_umr; /* The device supports UMR. */
>  };
> 
> +#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> +static inline int
> +regex_get_pdn(void *pd, uint32_t *pdn)
> +{
> +	struct mlx5dv_obj obj;
> +	struct mlx5dv_pd pd_info;
> +	int ret = 0;
> +
> +	obj.pd.in = pd;
> +	obj.pd.out = &pd_info;
> +	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
> +	if (ret) {
> +		DRV_LOG(DEBUG, "Fail to get PD object info");
> +		return ret;
> +	}
> +	*pdn = pd_info.pdn;
> +	return 0;
> +}
> +#endif
> +
>  /* mlx5_regex.c */
>  int mlx5_regex_start(struct rte_regexdev *dev);
>  int mlx5_regex_stop(struct rte_regexdev *dev);
> @@ -108,5 +131,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev
> *dev, uint16_t qp_id,
>  		       struct rte_regex_ops **ops, uint16_t nb_ops);
>  uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
>  		       struct rte_regex_ops **ops, uint16_t nb_ops);
> -
> +uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t
> qp_id,
> +		       struct rte_regex_ops **ops, uint16_t nb_ops);
>  #endif /* MLX5_REGEX_H */
> diff --git a/drivers/regex/mlx5/mlx5_regex_control.c
> b/drivers/regex/mlx5/mlx5_regex_control.c
> index 55fbb419ed..eef0fe579d 100644
> --- a/drivers/regex/mlx5/mlx5_regex_control.c
> +++ b/drivers/regex/mlx5/mlx5_regex_control.c
> @@ -27,6 +27,9 @@
> 
>  #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
> 
> +#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
> +		((has_umr) ? ((log_desc) + 2) : (log_desc))
> +
>  /**
>   * Returns the number of qp obj to be created.
>   *
> @@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv,
> struct mlx5_regex_cq *cq)
>  	return 0;
>  }
> 
> -#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> -static int
> -regex_get_pdn(void *pd, uint32_t *pdn)
> -{
> -	struct mlx5dv_obj obj;
> -	struct mlx5dv_pd pd_info;
> -	int ret = 0;
> -
> -	obj.pd.in = pd;
> -	obj.pd.out = &pd_info;
> -	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
> -	if (ret) {
> -		DRV_LOG(DEBUG, "Fail to get PD object info");
> -		return ret;
> -	}
> -	*pdn = pd_info.pdn;
> -	return 0;
> -}
> -#endif
> -
>  /**
>   * Destroy the SQ object.
>   *
> @@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv,
> struct mlx5_regex_qp *qp,
>  	int ret;
> 
>  	sq->log_nb_desc = log_nb_desc;
> +	sq->sqn = q_ind;
>  	sq->ci = 0;
>  	sq->pi = 0;
>  	ret = regex_get_pdn(priv->pd, &pd_num);
>  	if (ret)
>  		return ret;
>  	attr.wq_attr.pd = pd_num;
> -	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
> -				  SOCKET_ID_ANY);
> +	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
> +			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr,
> log_nb_desc),
> +			&attr, SOCKET_ID_ANY);
>  	if (ret) {
>  		DRV_LOG(ERR, "Can't create SQ object.");
>  		rte_errno = ENOMEM;
> @@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev,
> uint16_t qp_ind,
> 
>  	qp = &priv->qps[qp_ind];
>  	qp->flags = cfg->qp_conf_flags;
> -	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
> -	qp->nb_desc = 1 << qp->cq.log_nb_desc;
> +	log_desc = rte_log2_u32(cfg->nb_desc);
> +	/*
> +	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one
> descriptor.
> +	 * For CQ, expand the CQE number multiple with 2.
> +	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4
> WQEBBS,
> +	 * expand the WQE number multiple with 4.
> +	 */
> +	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
> +	qp->nb_desc = 1 << log_desc;
>  	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
> -		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
> +		qp->nb_obj = regex_ctrl_get_nb_obj
> +			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr,
> log_desc));
>  	else
>  		qp->nb_obj = 1;
>  	qp->sqs = rte_malloc(NULL,
> diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> index beaea7b63f..4f9402c583 100644
> --- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> +++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> @@ -32,6 +32,15 @@
>  #define MLX5_REGEX_WQE_GATHER_OFFSET 32
>  #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
>  #define MLX5_REGEX_METADATA_OFF 32
> +#define MLX5_REGEX_UMR_WQE_SIZE 192
> +/* The maximum KLMs can be added to one UMR indirect mkey. */
> +#define MLX5_REGEX_MAX_KLM_NUM 128
> +/* The KLM array size for one job. */
> +#define MLX5_REGEX_KLMS_SIZE \
> +	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
> +/* In WQE set mode, the pi should be quarter of the
> MLX5_REGEX_MAX_WQE_INDEX. */
> +#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
> +	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
> 
>  static inline uint32_t
>  sq_size_get(struct mlx5_regex_sq *sq)
> @@ -49,6 +58,8 @@ struct mlx5_regex_job {
>  	uint64_t user_id;
>  	volatile uint8_t *output;
>  	volatile uint8_t *metadata;
> +	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
> +	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
>  } __rte_cached_aligned;
> 
>  static inline void
> @@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg,
> uint16_t pi, uint8_t opcode,
>  }
> 
>  static inline void
> -prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
> -	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
> -	 struct mlx5_regex_job *job)
> +__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
> +	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
> +	   size_t pi, struct mlx5_klm *klm)
>  {
> -	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) *
> MLX5_SEND_WQE_BB;
> -	uint32_t lkey;
> +	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
> +			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
> +			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE :
> 0);
>  	uint16_t group0 = op->req_flags &
> RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
>  				op->group_id0 : 0;
>  	uint16_t group1 = op->req_flags &
> RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
> @@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
>  			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
>  			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
>  		group0 = op->group_id0;
> -	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
> -				  &priv->mr_scache, &qp->mr_ctrl,
> -				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
> -				  !!(op->mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
>  	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
>  	int ds = 4; /*  ctrl + meta + input + output */
> 
> -	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
> +	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
> +			 (priv->has_umr ? (pi * 4 + 3) : pi),
>  			 MLX5_OPCODE_MMO,
> MLX5_OPC_MOD_MMO_REGEX,
>  			 sq->sq_obj.sq->id, 0, ds, 0, 0);
>  	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
> @@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
>  	struct mlx5_wqe_data_seg *input_seg =
>  		(struct mlx5_wqe_data_seg *)(wqe +
> 
> MLX5_REGEX_WQE_GATHER_OFFSET);
> -	input_seg->byte_count =
> -		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
> -	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
> -							    uintptr_t));
> -	input_seg->lkey = lkey;
> +	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
> +	input_seg->addr = rte_cpu_to_be_64(klm->address);
> +	input_seg->lkey = klm->mkey;
>  	job->user_id = op->user_id;
> +}
> +
> +static inline void
> +prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
> +	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
> +	 struct mlx5_regex_job *job)
> +{
> +	struct mlx5_klm klm;
> +
> +	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
> +	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
> +				  &priv->mr_scache, &qp->mr_ctrl,
> +				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
> +				  !!(op->mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
> +	__prep_one(priv, sq, op, job, sq->pi, &klm);
>  	sq->db_pi = sq->pi;
>  	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
>  }
> 
>  static inline void
> -send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
> +send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
>  {
> +	struct mlx5dv_devx_uar *uar = priv->uar;
>  	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
> -		MLX5_SEND_WQE_BB;
> +		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
> +		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
>  	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
> -	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se =
> MLX5_WQE_CTRL_CQ_UPDATE;
> +	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
> +	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |=
> MLX5_WQE_CTRL_CQ_UPDATE;
>  	uint64_t *doorbell_addr =
>  		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
>  	rte_io_wmb();
> -	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi +
> 1) &
> -
> MLX5_REGEX_MAX_WQE_INDEX);
> +	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv-
> >has_umr ?
> +					(sq->db_pi * 4 + 3) : sq->db_pi) &
> +					MLX5_REGEX_MAX_WQE_INDEX);
>  	rte_wmb();
>  	*doorbell_addr = *(volatile uint64_t *)wqe;
>  	rte_wmb();
>  }
> 
>  static inline int
> -can_send(struct mlx5_regex_sq *sq) {
> -	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
> +get_free(struct mlx5_regex_sq *sq) {
> +	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
>  }
> 
>  static inline uint32_t
> @@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
>  	return qid * sq_size + (index & (sq_size - 1));
>  }
> 
> +#ifdef HAVE_MLX5_UMR_IMKEY
> +static inline int
> +mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
> +{
> +	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
> +}
> +
> +static inline void
> +complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
> +		 struct mlx5_regex_job *mkey_job,
> +		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
> +{
> +	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
> +		(MLX5_SEND_WQE_BB * 4);
> +	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg
> *)((uint8_t *)
> +				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
> +	struct mlx5_wqe_umr_ctrl_seg *ucseg =
> +				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
> +	struct mlx5_wqe_mkey_context_seg *mkc =
> +				(struct mlx5_wqe_mkey_context_seg *)(ucseg
> + 1);
> +	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
> +	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
> +
> +	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
> +	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9
> WQE_DS. */
> +	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
> +			 0, sq->sq_obj.sq->id, 0, 9, 0,
> +			 rte_cpu_to_be_32(mkey_job->imkey->id));
> +	/* Set UMR WQE control seg. */
> +	ucseg->mkey_mask |=
> rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
> +
> 	MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
> +
> 	MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
> +	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
> +	/* Set mkey context seg. */
> +	mkc->len = rte_cpu_to_be_64(total_len);
> +	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
> +					(mkey_job->imkey->id & 0xff));
> +	/* Set UMR pointer to data seg. */
> +	iklm->address = rte_cpu_to_be_64
> +				((uintptr_t)((char *)mkey_job->imkey_array));
> +	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
> +	iklm->byte_count = rte_cpu_to_be_32(klm_align);
> +	/* Clear the padding memory. */
> +	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
> +	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
> +
> +	/* Add the following RegEx WQE with fence. */
> +	wqe = (struct mlx5_wqe_ctrl_seg *)
> +				(((uint8_t *)wqe) +
> MLX5_REGEX_UMR_WQE_SIZE);
> +	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
> +}
> +
> +static inline void
> +prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq
> *sq,
> +		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
> +		       size_t pi, struct mlx5_klm *klm)
> +{
> +	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
> +			    (MLX5_SEND_WQE_BB << 2);
> +	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg
> *)((uint8_t *)
> +				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
> +
> +	/* Clear the WQE memory used as UMR WQE previously. */
> +	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) !=
> MLX5_OPCODE_NOP)
> +		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
> +	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
> +	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq-
> >id,
> +			 0, 12, 0, 0);
> +	__prep_one(priv, sq, op, job, pi, klm);
> +}
> +
> +static inline void
> +prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
> +	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
> +{
> +	struct mlx5_regex_job *job = NULL;
> +	size_t sqid = sq->sqn, mkey_job_id = 0;
> +	size_t left_ops = nb_ops;
> +	uint32_t klm_num = 0, len;
> +	struct mlx5_klm *mkey_klm = NULL;
> +	struct mlx5_klm klm;
> +
> +	sqid = sq->sqn;
> +	while (left_ops--)
> +		rte_prefetch0(op[left_ops]);
> +	left_ops = nb_ops;
> +	/*
> +	 * Build the WQE set by reverse. In case the burst may consume
> +	 * multiple mkeys, build the WQE set as normal will hard to
> +	 * address the last mkey index, since we will only know the last
> +	 * RegEx WQE's index when finishes building.
> +	 */
> +	while (left_ops--) {
> +		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
> +		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
> +
> +		if (mbuf->nb_segs > 1) {
> +			size_t scatter_size = 0;
> +
> +			if (!mkey_klm_available(mkey_klm, klm_num,
> +						mbuf->nb_segs)) {
> +				/*
> +				 * The mkey's KLM is full, create the UMR
> +				 * WQE in the next WQE set.
> +				 */
> +				if (mkey_klm)
> +					complete_umr_wqe(qp, sq,
> +						&qp->jobs[mkey_job_id],
> +
> 	MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
> +						klm_num, len);
> +				/*
> +				 * Get the indircet mkey and KLM array index
> +				 * from the last WQE set.
> +				 */
> +				mkey_job_id = job_id_get(sqid,
> +							 sq_size_get(sq), pi);
> +				mkey_klm = qp-
> >jobs[mkey_job_id].imkey_array;
> +				klm_num = 0;
> +				len = 0;
> +			}
> +			/* Build RegEx WQE's data segment KLM. */
> +			klm.address = len;
> +			klm.mkey = rte_cpu_to_be_32
> +					(qp->jobs[mkey_job_id].imkey->id);
> +			while (mbuf) {
> +				/* Build indirect mkey seg's KLM. */
> +				mkey_klm->mkey =
> mlx5_mr_addr2mr_bh(priv->pd,
> +					NULL, &priv->mr_scache, &qp-
> >mr_ctrl,
> +					rte_pktmbuf_mtod(mbuf, uintptr_t),
> +					!!(mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +				mkey_klm->address = rte_cpu_to_be_64
> +					(rte_pktmbuf_mtod(mbuf, uintptr_t));
> +				mkey_klm->byte_count = rte_cpu_to_be_32
> +
> 	(rte_pktmbuf_data_len(mbuf));
> +				/*
> +				 * Save the mbuf's total size for RegEx data
> +				 * segment.
> +				 */
> +				scatter_size += rte_pktmbuf_data_len(mbuf);
> +				mkey_klm++;
> +				klm_num++;
> +				mbuf = mbuf->next;
> +			}
> +			len += scatter_size;
> +			klm.byte_count = scatter_size;
> +		} else {
> +			/* The single mubf case. Build the KLM directly. */
> +			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
> +					&priv->mr_scache, &qp->mr_ctrl,
> +					rte_pktmbuf_mtod(mbuf, uintptr_t),
> +					!!(mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
> +			klm.byte_count = rte_pktmbuf_data_len(mbuf);
> +		}
> +		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
> +		/*
> +		 * Build the nop + RegEx WQE set by default. The fist nop WQE
> +		 * will be updated later as UMR WQE if scattered mubf exist.
> +		 */
> +		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
> +	}
> +	/*
> +	 * Scattered mbuf have been added to the KLM array. Complete the
> build
> +	 * of UMR WQE, update the first nop WQE as UMR WQE.
> +	 */
> +	if (mkey_klm)
> +		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
> +				 klm_num, len);
> +	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
> +	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
> +}
> +
> +uint16_t
> +mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
> +			  struct rte_regex_ops **ops, uint16_t nb_ops)
> +{
> +	struct mlx5_regex_priv *priv = dev->data->dev_private;
> +	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
> +	struct mlx5_regex_sq *sq;
> +	size_t sqid, nb_left = nb_ops, nb_desc;
> +
> +	while ((sqid = ffs(queue->free_sqs))) {
> +		sqid--; /* ffs returns 1 for bit 0 */
> +		sq = &queue->sqs[sqid];
> +		nb_desc = get_free(sq);
> +		if (nb_desc) {
> +			/* The ops be handled can't exceed nb_ops. */
> +			if (nb_desc > nb_left)
> +				nb_desc = nb_left;
> +			else
> +				queue->free_sqs &= ~(1 << sqid);
> +			prep_regex_umr_wqe_set(priv, queue, sq, ops,
> nb_desc);
> +			send_doorbell(priv, sq);
> +			nb_left -= nb_desc;
> +		}
> +		if (!nb_left)
> +			break;
> +		ops += nb_desc;
> +	}
> +	nb_ops -= nb_left;
> +	queue->pi += nb_ops;
> +	return nb_ops;
> +}
> +#endif
> +
>  uint16_t
>  mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
>  		      struct rte_regex_ops **ops, uint16_t nb_ops)
> @@ -186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  	while ((sqid = ffs(queue->free_sqs))) {
>  		sqid--; /* ffs returns 1 for bit 0 */
>  		sq = &queue->sqs[sqid];
> -		while (can_send(sq)) {
> +		while (get_free(sq)) {
>  			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
>  			prep_one(priv, queue, sq, ops[i], &queue-
> >jobs[job_id]);
>  			i++;
>  			if (unlikely(i == nb_ops)) {
> -				send_doorbell(priv->uar, sq);
> +				send_doorbell(priv, sq);
>  				goto out;
>  			}
>  		}
>  		queue->free_sqs &= ~(1 << sqid);
> -		send_doorbell(priv->uar, sq);
> +		send_doorbell(priv, sq);
>  	}
> 
>  out:
> @@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  			  MLX5_REGEX_MAX_WQE_INDEX;
>  		size_t sqid = cqe->rsvd3[2];
>  		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
> +
> +		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
> +		if (priv->has_umr)
> +			wq_counter >>= 2;
>  		while (sq->ci != wq_counter) {
>  			if (unlikely(i == nb_ops)) {
>  				/* Return without updating cq->ci */
> @@ -316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
>  						     sq->ci);
>  			extract_result(ops[i], &queue->jobs[job_id]);
> -			sq->ci = (sq->ci + 1) &
> MLX5_REGEX_MAX_WQE_INDEX;
> +			sq->ci = (sq->ci + 1) & (priv->has_umr ?
> +				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
> +				  MLX5_REGEX_MAX_WQE_INDEX);
>  			i++;
>  		}
>  		cq->ci = (cq->ci + 1) & 0xffffff;
> @@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  }
> 
>  static void
> -setup_sqs(struct mlx5_regex_qp *queue)
> +setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
>  {
>  	size_t sqid, entry;
>  	uint32_t job_id;
> @@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
>  			job_id = sqid * sq_size_get(sq) + entry;
>  			struct mlx5_regex_job *job = &queue->jobs[job_id];
> 
> +			/* Fill UMR WQE with NOP in advanced. */
> +			if (priv->has_umr) {
> +				set_wqe_ctrl_seg
> +					((struct mlx5_wqe_ctrl_seg *)wqe,
> +					 entry * 2, MLX5_OPCODE_NOP, 0,
> +					 sq->sq_obj.sq->id, 0, 12, 0, 0);
> +				wqe += MLX5_REGEX_UMR_WQE_SIZE;
> +			}
>  			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
>  					 (wqe +
> MLX5_REGEX_WQE_METADATA_OFFSET),
>  					 0, queue->metadata->lkey,
> @@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
>  }
> 
>  static int
> -setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
> +setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
>  {
> +	struct ibv_pd *pd = priv->pd;
>  	uint32_t i;
>  	int err;
> 
> @@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct
> ibv_pd *pd)
>  		goto err_output;
>  	}
> 
> +	if (priv->has_umr) {
> +		ptr = rte_calloc(__func__, qp->nb_desc,
> MLX5_REGEX_KLMS_SIZE,
> +				 MLX5_REGEX_KLMS_SIZE);
> +		if (!ptr) {
> +			err = -ENOMEM;
> +			goto err_imkey;
> +		}
> +		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
> +					MLX5_REGEX_KLMS_SIZE * qp-
> >nb_desc,
> +					IBV_ACCESS_LOCAL_WRITE);
> +		if (!qp->imkey_addr) {
> +			rte_free(ptr);
> +			DRV_LOG(ERR, "Failed to register output");
> +			err = -EINVAL;
> +			goto err_imkey;
> +		}
> +	}
> +
>  	/* distribute buffers to jobs */
>  	for (i = 0; i < qp->nb_desc; i++) {
>  		qp->jobs[i].output =
> @@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct
> ibv_pd *pd)
>  		qp->jobs[i].metadata =
>  			(uint8_t *)qp->metadata->addr +
>  			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
> +		if (qp->imkey_addr)
> +			qp->jobs[i].imkey_array = (struct mlx5_klm *)
> +				qp->imkey_addr->addr +
> +				(i % qp->nb_desc) *
> MLX5_REGEX_MAX_KLM_NUM;
>  	}
> +
>  	return 0;
> 
> +err_imkey:
> +	ptr = qp->outputs->addr;
> +	rte_free(ptr);
> +	mlx5_glue->dereg_mr(qp->outputs);
>  err_output:
>  	ptr = qp->metadata->addr;
>  	rte_free(ptr);
> @@ -417,23 +691,57 @@ int
>  mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
>  {
>  	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
> -	int err;
> +	struct mlx5_klm klm = { 0 };
> +	struct mlx5_devx_mkey_attr attr = {
> +		.klm_array = &klm,
> +		.klm_num = 1,
> +		.umr_en = 1,
> +	};
> +	uint32_t i;
> +	int err = 0;
> 
>  	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
>  	if (!qp->jobs)
>  		return -ENOMEM;
> -	err = setup_buffers(qp, priv->pd);
> +	err = setup_buffers(priv, qp);
>  	if (err) {
>  		rte_free(qp->jobs);
>  		return err;
>  	}
> -	setup_sqs(qp);
> -	return 0;
> +
> +	setup_sqs(priv, qp);
> +
> +	if (priv->has_umr) {
> +#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> +		if (regex_get_pdn(priv->pd, &attr.pd)) {
> +			err = -rte_errno;
> +			DRV_LOG(ERR, "Failed to get pdn.");
> +			mlx5_regexdev_teardown_fastpath(priv, qp_id);
> +			return err;
> +		}
> +#endif
> +		for (i = 0; i < qp->nb_desc; i++) {
> +			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
> +			attr.klm_array = qp->jobs[i].imkey_array;
> +			qp->jobs[i].imkey =
> mlx5_devx_cmd_mkey_create(priv->ctx,
> +								      &attr);
> +			if (!qp->jobs[i].imkey) {
> +				err = -rte_errno;
> +				DRV_LOG(ERR, "Failed to allocate imkey.");
> +				mlx5_regexdev_teardown_fastpath(priv,
> qp_id);
> +			}
> +		}
> +	}
> +	return err;
>  }
> 
>  static void
>  free_buffers(struct mlx5_regex_qp *qp)
>  {
> +	if (qp->imkey_addr) {
> +		mlx5_glue->dereg_mr(qp->imkey_addr);
> +		rte_free(qp->imkey_addr->addr);
> +	}
>  	if (qp->metadata) {
>  		mlx5_glue->dereg_mr(qp->metadata);
>  		rte_free(qp->metadata->addr);
> @@ -448,8 +756,14 @@ void
>  mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t
> qp_id)
>  {
>  	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
> +	uint32_t i;
> 
>  	if (qp) {
> +		for (i = 0; i < qp->nb_desc; i++) {
> +			if (qp->jobs[i].imkey)
> +				claim_zero(mlx5_devx_cmd_destroy
> +							(qp->jobs[i].imkey));
> +		}
>  		free_buffers(qp);
>  		if (qp->jobs)
>  			rte_free(qp->jobs);
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
@ 2021-03-29  9:35     ` Ori Kam
  0 siblings, 0 replies; 36+ messages in thread
From: Ori Kam @ 2021-03-29  9:35 UTC (permalink / raw)
  To: Suanming Mou
  Cc: dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh, John Hurley

Hi

> -----Original Message-----
> From: Suanming Mou <suanmingm@nvidia.com>
> 
> From: John Hurley <jhurley@nvidia.com>
> 
> A recent change adds support for scattered mbuf and UMR support for regex.
> Part of this commit makes the pi and ci counters of the regex_sq a quarter
> of the length in non umr mode, effectively moving them from 16 bits to
> 14. The new get_free method casts the difference in pi and ci to a 16 bit
> value when calculating the free send queues, accounting for any wrapping
> when pi has looped back to 0 but ci has not yet. However, the move to 14
> bits while still casting to 16 can now lead to corrupted, large values
> returned.
> 
> Modify the get_free function to take in the has_umr flag and, accordingly,
> account for wrapping on either 14 or 16 bit pi/ci difference.
> 
> Fixes: a20fe8e74dea ("regex/mlx5: add data path scattered mbuf process")
> Signed-off-by: John Hurley <jhurley@nvidia.com>
> ---
>  drivers/regex/mlx5/mlx5_regex_fastpath.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> index 4f9402c583..b57e7d7794 100644
> --- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> +++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> @@ -192,8 +192,10 @@ send_doorbell(struct mlx5_regex_priv *priv, struct
> mlx5_regex_sq *sq)
>  }
> 
>  static inline int
> -get_free(struct mlx5_regex_sq *sq) {
> -	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
> +get_free(struct mlx5_regex_sq *sq, uint8_t has_umr) {
> +	return (sq_size_get(sq) - ((sq->pi - sq->ci) &
> +			(has_umr ? (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
> +			MLX5_REGEX_MAX_WQE_INDEX)));
>  }
> 
>  static inline uint32_t
> @@ -385,7 +387,7 @@ mlx5_regexdev_enqueue_gga(struct rte_regexdev
> *dev, uint16_t qp_id,
>  	while ((sqid = ffs(queue->free_sqs))) {
>  		sqid--; /* ffs returns 1 for bit 0 */
>  		sq = &queue->sqs[sqid];
> -		nb_desc = get_free(sq);
> +		nb_desc = get_free(sq, priv->has_umr);
>  		if (nb_desc) {
>  			/* The ops be handled can't exceed nb_ops. */
>  			if (nb_desc > nb_left)
> @@ -418,7 +420,7 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  	while ((sqid = ffs(queue->free_sqs))) {
>  		sqid--; /* ffs returns 1 for bit 0 */
>  		sq = &queue->sqs[sqid];
> -		while (get_free(sq)) {
> +		while (get_free(sq, priv->has_umr)) {
>  			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
>  			prep_one(priv, queue, sq, ops[i], &queue-
> >jobs[job_id]);
>  			i++;
> --
> 2.25.1

Acked-by: Ori Kam <orika@nvidia.com>
Best,
Ori


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-29  9:34     ` Ori Kam
@ 2021-03-29  9:52       ` Suanming Mou
  0 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-29  9:52 UTC (permalink / raw)
  To: Ori Kam; +Cc: dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh

Hi Ori,

> -----Original Message-----
> From: Ori Kam <orika@nvidia.com>
> Sent: Monday, March 29, 2021 5:35 PM
> To: Suanming Mou <suanmingm@nvidia.com>
> Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Raslan Darawsheh <rasland@nvidia.com>
> Subject: RE: [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process
> 
> Hi Mou,
> 
> PSB one small comment,
> Please update and feel free to add my ack.
> 
> > diff --git a/doc/guides/tools/testregex.rst
> > b/doc/guides/tools/testregex.rst index a59acd919f..cdb1ffd6ee 100644
> > --- a/doc/guides/tools/testregex.rst
> > +++ b/doc/guides/tools/testregex.rst
> > @@ -68,6 +68,9 @@ Application Options
> >  ``--nb_iter N``
> >    number of iteration to run
> >
> > +``--nb_segs N``
> > +  number of mbuf segment
> > +
> >  ``--help``
> >    print application options
> 
> I don't think this is part of this patch.
> It should belong to the app patch.
> 
Yes, will fix that in v3. Thanks.

BR,
SuanmingMou

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (4 preceding siblings ...)
  2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
@ 2021-03-30  1:39 ` Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: add user memory registration bits Suanming Mou
                     ` (3 more replies)
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  7 siblings, 4 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  1:39 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

The scattered mbuf was not supported in mlx5 RegEx driver. This patch
set adds the support of scattered mbuf by UMR WQE.

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

v3:
1. Move testregex.rst change to the correct commit.
2. Code rebase to the latest version.

v2:
1. Check mbuf multiple seg by nb_segs.
2. Add ops prefetch.
3. Allocate ops and mbuf memory together in test application.
4. Fix ci and pi in correct issue.


John Hurley (1):
  regex/mlx5: prevent wrong calculation of free sqs in umr mode

Suanming Mou (3):
  common/mlx5: add user memory registration bits
  regex/mlx5: add data path scattered mbuf process
  app/test-regex: support scattered mbuf input

 app/test-regex/main.c                    | 134 ++++++--
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 doc/guides/tools/testregex.rst           |   3 +
 drivers/common/mlx5/linux/meson.build    |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.c     |   5 +
 drivers/common/mlx5/mlx5_devx_cmds.h     |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 380 +++++++++++++++++++++--
 11 files changed, 531 insertions(+), 83 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 1/4] common/mlx5: add user memory registration bits
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-03-30  1:39   ` Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  1:39 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commit adds the UMR capability bits.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 2 ++
 drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
 drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 220de35420..5d6a861689 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -186,6 +186,8 @@ has_sym_args = [
 	'mlx5dv_dr_action_create_aso' ],
 	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
 	'INFINIBAND_VERBS_H' ],
+        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
+        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
 ]
 config = configuration_data()
 foreach arg:has_sym_args
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c90e020643..268bcd0d99 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, attr->pd);
 	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
+	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
 	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
@@ -752,6 +753,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						mini_cqe_resp_flow_tag);
 	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
 						 mini_cqe_resp_l3_l4_tag);
+	attr->umr_indirect_mkey_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_indirect_mkey_disabled);
+	attr->umr_modify_entity_size_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_modify_entity_size_disabled);
 	if (attr->qos.sup) {
 		MLX5_SET(query_hca_cap_in, in, op_mod,
 			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 2826c0b2c6..67b5f771c6 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t umr_en:1;
 	struct mlx5_klm *klm_array;
 	int klm_num;
 };
@@ -151,6 +152,8 @@ struct mlx5_hca_attr {
 	uint32_t log_max_mmo_dma:5;
 	uint32_t log_max_mmo_compress:5;
 	uint32_t log_max_mmo_decompress:5;
+	uint32_t umr_modify_entity_size_disabled:1;
+	uint32_t umr_indirect_mkey_disabled:1;
 };
 
 struct mlx5_devx_wq_attr {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-03-30  1:39   ` Suanming Mou
  2021-03-30  8:05     ` Slava Ovsiienko
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 3/4] app/test-regex: support scattered mbuf input Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  1:39 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Take advantage of the UMR WQE, scattered mbuf in one operation can be
converted to an indirect mkey. The RegEx which only accepts one mkey
can now process the whole scattered mbuf.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. Multiple operations scattered mbufs can be add to one
UMR WQE if there is enough space in the KLM array, since the operations
can address their own mbuf's content by the mkey's address and length.
However, one operation's scattered mbuf's can't be placed in two
different UMR WQE's KLM array, if the UMR WQE's KLM does not has enough
free space for one operation, a new UMR WQE will be required.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the meky's index used by the UMR WQE should be the index of
last the RegEX WQE in the operations. As one operation consumes one
WQE set, build the RegEx WQE by reverse helps address the mkey more
efficiently. Once the operations in one burst consumes multiple mkeys,
when the mkey KLM array is full, the reverse WQE set index will always
be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx WQE by
interleave. The UMR and RegEx WQE can be called as WQE set. The SQ's pi
and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
 6 files changed, 407 insertions(+), 58 deletions(-)

diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
index faaa6ac11d..45a0b96980 100644
--- a/doc/guides/regexdevs/mlx5.rst
+++ b/doc/guides/regexdevs/mlx5.rst
@@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device can be probed and used with
 other Mellanox devices, by adding more options in the class.
 For example: ``class=net:regex`` will probe both the net PMD and the RegEx PMD.
 
+Features
+--------
+
+- Multi segments mbuf support.
+
 Supported NICs
 --------------
 
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 3c76148b11..c3d6b8e8ae 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -119,6 +119,10 @@ New Features
   * Added command to display Rx queue used descriptor count.
     ``show port (port_id) rxq (queue_id) desc used count``
 
+* **Updated Mellanox RegEx PMD.**
+
+  * Added support for multi segments mbuf.
+
 
 Removed Items
 -------------
diff --git a/drivers/regex/mlx5/mlx5_regex.c b/drivers/regex/mlx5/mlx5_regex.c
index ac5b205fa9..82c485e50c 100644
--- a/drivers/regex/mlx5/mlx5_regex.c
+++ b/drivers/regex/mlx5/mlx5_regex.c
@@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	}
 	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
 	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
+#ifdef HAVE_MLX5_UMR_IMKEY
+	if (!attr.umr_indirect_mkey_disabled &&
+	    !attr.umr_modify_entity_size_disabled)
+		priv->has_umr = 1;
+	if (priv->has_umr)
+		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
+#endif
 	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
 	priv->regexdev->device = (struct rte_device *)pci_dev;
 	priv->regexdev->data->dev_private = priv;
@@ -213,6 +220,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	    rte_errno = ENOMEM;
 		goto error;
 	}
+	DRV_LOG(INFO, "RegEx GGA is %s.",
+		priv->has_umr ? "supported" : "unsupported");
 	return 0;
 
 error:
diff --git a/drivers/regex/mlx5/mlx5_regex.h b/drivers/regex/mlx5/mlx5_regex.h
index a2b3f0d9f3..51a2101e53 100644
--- a/drivers/regex/mlx5/mlx5_regex.h
+++ b/drivers/regex/mlx5/mlx5_regex.h
@@ -15,6 +15,7 @@
 #include <mlx5_common_devx.h>
 
 #include "mlx5_rxp.h"
+#include "mlx5_regex_utils.h"
 
 struct mlx5_regex_sq {
 	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
@@ -40,6 +41,7 @@ struct mlx5_regex_qp {
 	struct mlx5_regex_job *jobs;
 	struct ibv_mr *metadata;
 	struct ibv_mr *outputs;
+	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
 	size_t ci, pi;
 	struct mlx5_mr_ctrl mr_ctrl;
 };
@@ -71,8 +73,29 @@ struct mlx5_regex_priv {
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	uint8_t is_bf2; /* The device is BF2 device. */
 	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats. */
+	uint8_t has_umr; /* The device supports UMR. */
 };
 
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+static inline int
+regex_get_pdn(void *pd, uint32_t *pdn)
+{
+	struct mlx5dv_obj obj;
+	struct mlx5dv_pd pd_info;
+	int ret = 0;
+
+	obj.pd.in = pd;
+	obj.pd.out = &pd_info;
+	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
+	if (ret) {
+		DRV_LOG(DEBUG, "Fail to get PD object info");
+		return ret;
+	}
+	*pdn = pd_info.pdn;
+	return 0;
+}
+#endif
+
 /* mlx5_regex.c */
 int mlx5_regex_start(struct rte_regexdev *dev);
 int mlx5_regex_stop(struct rte_regexdev *dev);
@@ -108,5 +131,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
 uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
-
+uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+		       struct rte_regex_ops **ops, uint16_t nb_ops);
 #endif /* MLX5_REGEX_H */
diff --git a/drivers/regex/mlx5/mlx5_regex_control.c b/drivers/regex/mlx5/mlx5_regex_control.c
index 55fbb419ed..eef0fe579d 100644
--- a/drivers/regex/mlx5/mlx5_regex_control.c
+++ b/drivers/regex/mlx5/mlx5_regex_control.c
@@ -27,6 +27,9 @@
 
 #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
 
+#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
+		((has_umr) ? ((log_desc) + 2) : (log_desc))
+
 /**
  * Returns the number of qp obj to be created.
  *
@@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv, struct mlx5_regex_cq *cq)
 	return 0;
 }
 
-#ifdef HAVE_IBV_FLOW_DV_SUPPORT
-static int
-regex_get_pdn(void *pd, uint32_t *pdn)
-{
-	struct mlx5dv_obj obj;
-	struct mlx5dv_pd pd_info;
-	int ret = 0;
-
-	obj.pd.in = pd;
-	obj.pd.out = &pd_info;
-	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
-	if (ret) {
-		DRV_LOG(DEBUG, "Fail to get PD object info");
-		return ret;
-	}
-	*pdn = pd_info.pdn;
-	return 0;
-}
-#endif
-
 /**
  * Destroy the SQ object.
  *
@@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	int ret;
 
 	sq->log_nb_desc = log_nb_desc;
+	sq->sqn = q_ind;
 	sq->ci = 0;
 	sq->pi = 0;
 	ret = regex_get_pdn(priv->pd, &pd_num);
 	if (ret)
 		return ret;
 	attr.wq_attr.pd = pd_num;
-	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
-				  SOCKET_ID_ANY);
+	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
+			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_nb_desc),
+			&attr, SOCKET_ID_ANY);
 	if (ret) {
 		DRV_LOG(ERR, "Can't create SQ object.");
 		rte_errno = ENOMEM;
@@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev, uint16_t qp_ind,
 
 	qp = &priv->qps[qp_ind];
 	qp->flags = cfg->qp_conf_flags;
-	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
-	qp->nb_desc = 1 << qp->cq.log_nb_desc;
+	log_desc = rte_log2_u32(cfg->nb_desc);
+	/*
+	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one descriptor.
+	 * For CQ, expand the CQE number multiple with 2.
+	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4 WQEBBS,
+	 * expand the WQE number multiple with 4.
+	 */
+	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
+	qp->nb_desc = 1 << log_desc;
 	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
-		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
+		qp->nb_obj = regex_ctrl_get_nb_obj
+			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_desc));
 	else
 		qp->nb_obj = 1;
 	qp->sqs = rte_malloc(NULL,
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index beaea7b63f..4f9402c583 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -32,6 +32,15 @@
 #define MLX5_REGEX_WQE_GATHER_OFFSET 32
 #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
 #define MLX5_REGEX_METADATA_OFF 32
+#define MLX5_REGEX_UMR_WQE_SIZE 192
+/* The maximum KLMs can be added to one UMR indirect mkey. */
+#define MLX5_REGEX_MAX_KLM_NUM 128
+/* The KLM array size for one job. */
+#define MLX5_REGEX_KLMS_SIZE \
+	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
+/* In WQE set mode, the pi should be quarter of the MLX5_REGEX_MAX_WQE_INDEX. */
+#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
+	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
 
 static inline uint32_t
 sq_size_get(struct mlx5_regex_sq *sq)
@@ -49,6 +58,8 @@ struct mlx5_regex_job {
 	uint64_t user_id;
 	volatile uint8_t *output;
 	volatile uint8_t *metadata;
+	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
+	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
 } __rte_cached_aligned;
 
 static inline void
@@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
 }
 
 static inline void
-prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
-	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
-	 struct mlx5_regex_job *job)
+__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
+	   size_t pi, struct mlx5_klm *klm)
 {
-	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) * MLX5_SEND_WQE_BB;
-	uint32_t lkey;
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint16_t group0 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
 				op->group_id0 : 0;
 	uint16_t group1 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
@@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
 			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
 		group0 = op->group_id0;
-	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
-				  &priv->mr_scache, &qp->mr_ctrl,
-				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
-				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
 	int ds = 4; /*  ctrl + meta + input + output */
 
-	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
+	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
+			 (priv->has_umr ? (pi * 4 + 3) : pi),
 			 MLX5_OPCODE_MMO, MLX5_OPC_MOD_MMO_REGEX,
 			 sq->sq_obj.sq->id, 0, ds, 0, 0);
 	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
@@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	struct mlx5_wqe_data_seg *input_seg =
 		(struct mlx5_wqe_data_seg *)(wqe +
 					     MLX5_REGEX_WQE_GATHER_OFFSET);
-	input_seg->byte_count =
-		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
-	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
-							    uintptr_t));
-	input_seg->lkey = lkey;
+	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
+	input_seg->addr = rte_cpu_to_be_64(klm->address);
+	input_seg->lkey = klm->mkey;
 	job->user_id = op->user_id;
+}
+
+static inline void
+prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
+	 struct mlx5_regex_job *job)
+{
+	struct mlx5_klm klm;
+
+	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
+	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
+				  &priv->mr_scache, &qp->mr_ctrl,
+				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
+				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
+	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
+	__prep_one(priv, sq, op, job, sq->pi, &klm);
 	sq->db_pi = sq->pi;
 	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
 }
 
 static inline void
-send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
+send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 {
+	struct mlx5dv_devx_uar *uar = priv->uar;
 	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
-		MLX5_SEND_WQE_BB;
+		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
-	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
+	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |= MLX5_WQE_CTRL_CQ_UPDATE;
 	uint64_t *doorbell_addr =
 		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
 	rte_io_wmb();
-	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi + 1) &
-						 MLX5_REGEX_MAX_WQE_INDEX);
+	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv->has_umr ?
+					(sq->db_pi * 4 + 3) : sq->db_pi) &
+					MLX5_REGEX_MAX_WQE_INDEX);
 	rte_wmb();
 	*doorbell_addr = *(volatile uint64_t *)wqe;
 	rte_wmb();
 }
 
 static inline int
-can_send(struct mlx5_regex_sq *sq) {
-	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
+get_free(struct mlx5_regex_sq *sq) {
+	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
 }
 
 static inline uint32_t
@@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
 	return qid * sq_size + (index & (sq_size - 1));
 }
 
+#ifdef HAVE_MLX5_UMR_IMKEY
+static inline int
+mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
+{
+	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
+}
+
+static inline void
+complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
+		 struct mlx5_regex_job *mkey_job,
+		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
+{
+	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
+		(MLX5_SEND_WQE_BB * 4);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+	struct mlx5_wqe_umr_ctrl_seg *ucseg =
+				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
+	struct mlx5_wqe_mkey_context_seg *mkc =
+				(struct mlx5_wqe_mkey_context_seg *)(ucseg + 1);
+	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
+	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
+
+	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9 WQE_DS. */
+	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
+			 0, sq->sq_obj.sq->id, 0, 9, 0,
+			 rte_cpu_to_be_32(mkey_job->imkey->id));
+	/* Set UMR WQE control seg. */
+	ucseg->mkey_mask |= rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
+				MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
+				MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
+	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
+	/* Set mkey context seg. */
+	mkc->len = rte_cpu_to_be_64(total_len);
+	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
+					(mkey_job->imkey->id & 0xff));
+	/* Set UMR pointer to data seg. */
+	iklm->address = rte_cpu_to_be_64
+				((uintptr_t)((char *)mkey_job->imkey_array));
+	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
+	iklm->byte_count = rte_cpu_to_be_32(klm_align);
+	/* Clear the padding memory. */
+	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
+	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
+
+	/* Add the following RegEx WQE with fence. */
+	wqe = (struct mlx5_wqe_ctrl_seg *)
+				(((uint8_t *)wqe) + MLX5_REGEX_UMR_WQE_SIZE);
+	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
+}
+
+static inline void
+prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
+		       size_t pi, struct mlx5_klm *klm)
+{
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << 2);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+
+	/* Clear the WQE memory used as UMR WQE previously. */
+	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) != MLX5_OPCODE_NOP)
+		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
+	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq->id,
+			 0, 12, 0, 0);
+	__prep_one(priv, sq, op, job, pi, klm);
+}
+
+static inline void
+prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
+{
+	struct mlx5_regex_job *job = NULL;
+	size_t sqid = sq->sqn, mkey_job_id = 0;
+	size_t left_ops = nb_ops;
+	uint32_t klm_num = 0, len;
+	struct mlx5_klm *mkey_klm = NULL;
+	struct mlx5_klm klm;
+
+	sqid = sq->sqn;
+	while (left_ops--)
+		rte_prefetch0(op[left_ops]);
+	left_ops = nb_ops;
+	/*
+	 * Build the WQE set by reverse. In case the burst may consume
+	 * multiple mkeys, build the WQE set as normal will hard to
+	 * address the last mkey index, since we will only know the last
+	 * RegEx WQE's index when finishes building.
+	 */
+	while (left_ops--) {
+		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
+		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
+
+		if (mbuf->nb_segs > 1) {
+			size_t scatter_size = 0;
+
+			if (!mkey_klm_available(mkey_klm, klm_num,
+						mbuf->nb_segs)) {
+				/*
+				 * The mkey's KLM is full, create the UMR
+				 * WQE in the next WQE set.
+				 */
+				if (mkey_klm)
+					complete_umr_wqe(qp, sq,
+						&qp->jobs[mkey_job_id],
+						MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
+						klm_num, len);
+				/*
+				 * Get the indircet mkey and KLM array index
+				 * from the last WQE set.
+				 */
+				mkey_job_id = job_id_get(sqid,
+							 sq_size_get(sq), pi);
+				mkey_klm = qp->jobs[mkey_job_id].imkey_array;
+				klm_num = 0;
+				len = 0;
+			}
+			/* Build RegEx WQE's data segment KLM. */
+			klm.address = len;
+			klm.mkey = rte_cpu_to_be_32
+					(qp->jobs[mkey_job_id].imkey->id);
+			while (mbuf) {
+				/* Build indirect mkey seg's KLM. */
+				mkey_klm->mkey = mlx5_mr_addr2mr_bh(priv->pd,
+					NULL, &priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+				mkey_klm->address = rte_cpu_to_be_64
+					(rte_pktmbuf_mtod(mbuf, uintptr_t));
+				mkey_klm->byte_count = rte_cpu_to_be_32
+						(rte_pktmbuf_data_len(mbuf));
+				/*
+				 * Save the mbuf's total size for RegEx data
+				 * segment.
+				 */
+				scatter_size += rte_pktmbuf_data_len(mbuf);
+				mkey_klm++;
+				klm_num++;
+				mbuf = mbuf->next;
+			}
+			len += scatter_size;
+			klm.byte_count = scatter_size;
+		} else {
+			/* The single mubf case. Build the KLM directly. */
+			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
+					&priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
+			klm.byte_count = rte_pktmbuf_data_len(mbuf);
+		}
+		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
+		/*
+		 * Build the nop + RegEx WQE set by default. The fist nop WQE
+		 * will be updated later as UMR WQE if scattered mubf exist.
+		 */
+		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
+	}
+	/*
+	 * Scattered mbuf have been added to the KLM array. Complete the build
+	 * of UMR WQE, update the first nop WQE as UMR WQE.
+	 */
+	if (mkey_klm)
+		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
+				 klm_num, len);
+	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
+	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
+}
+
+uint16_t
+mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+			  struct rte_regex_ops **ops, uint16_t nb_ops)
+{
+	struct mlx5_regex_priv *priv = dev->data->dev_private;
+	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
+	struct mlx5_regex_sq *sq;
+	size_t sqid, nb_left = nb_ops, nb_desc;
+
+	while ((sqid = ffs(queue->free_sqs))) {
+		sqid--; /* ffs returns 1 for bit 0 */
+		sq = &queue->sqs[sqid];
+		nb_desc = get_free(sq);
+		if (nb_desc) {
+			/* The ops be handled can't exceed nb_ops. */
+			if (nb_desc > nb_left)
+				nb_desc = nb_left;
+			else
+				queue->free_sqs &= ~(1 << sqid);
+			prep_regex_umr_wqe_set(priv, queue, sq, ops, nb_desc);
+			send_doorbell(priv, sq);
+			nb_left -= nb_desc;
+		}
+		if (!nb_left)
+			break;
+		ops += nb_desc;
+	}
+	nb_ops -= nb_left;
+	queue->pi += nb_ops;
+	return nb_ops;
+}
+#endif
+
 uint16_t
 mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		      struct rte_regex_ops **ops, uint16_t nb_ops)
@@ -186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (can_send(sq)) {
+		while (get_free(sq)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
 			if (unlikely(i == nb_ops)) {
-				send_doorbell(priv->uar, sq);
+				send_doorbell(priv, sq);
 				goto out;
 			}
 		}
 		queue->free_sqs &= ~(1 << sqid);
-		send_doorbell(priv->uar, sq);
+		send_doorbell(priv, sq);
 	}
 
 out:
@@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			  MLX5_REGEX_MAX_WQE_INDEX;
 		size_t sqid = cqe->rsvd3[2];
 		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
+
+		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
+		if (priv->has_umr)
+			wq_counter >>= 2;
 		while (sq->ci != wq_counter) {
 			if (unlikely(i == nb_ops)) {
 				/* Return without updating cq->ci */
@@ -316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
 						     sq->ci);
 			extract_result(ops[i], &queue->jobs[job_id]);
-			sq->ci = (sq->ci + 1) & MLX5_REGEX_MAX_WQE_INDEX;
+			sq->ci = (sq->ci + 1) & (priv->has_umr ?
+				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+				  MLX5_REGEX_MAX_WQE_INDEX);
 			i++;
 		}
 		cq->ci = (cq->ci + 1) & 0xffffff;
@@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 }
 
 static void
-setup_sqs(struct mlx5_regex_qp *queue)
+setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
 {
 	size_t sqid, entry;
 	uint32_t job_id;
@@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
 			job_id = sqid * sq_size_get(sq) + entry;
 			struct mlx5_regex_job *job = &queue->jobs[job_id];
 
+			/* Fill UMR WQE with NOP in advanced. */
+			if (priv->has_umr) {
+				set_wqe_ctrl_seg
+					((struct mlx5_wqe_ctrl_seg *)wqe,
+					 entry * 2, MLX5_OPCODE_NOP, 0,
+					 sq->sq_obj.sq->id, 0, 12, 0, 0);
+				wqe += MLX5_REGEX_UMR_WQE_SIZE;
+			}
 			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
 					 (wqe + MLX5_REGEX_WQE_METADATA_OFFSET),
 					 0, queue->metadata->lkey,
@@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
 }
 
 static int
-setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
+setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
 {
+	struct ibv_pd *pd = priv->pd;
 	uint32_t i;
 	int err;
 
@@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		goto err_output;
 	}
 
+	if (priv->has_umr) {
+		ptr = rte_calloc(__func__, qp->nb_desc, MLX5_REGEX_KLMS_SIZE,
+				 MLX5_REGEX_KLMS_SIZE);
+		if (!ptr) {
+			err = -ENOMEM;
+			goto err_imkey;
+		}
+		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
+					MLX5_REGEX_KLMS_SIZE * qp->nb_desc,
+					IBV_ACCESS_LOCAL_WRITE);
+		if (!qp->imkey_addr) {
+			rte_free(ptr);
+			DRV_LOG(ERR, "Failed to register output");
+			err = -EINVAL;
+			goto err_imkey;
+		}
+	}
+
 	/* distribute buffers to jobs */
 	for (i = 0; i < qp->nb_desc; i++) {
 		qp->jobs[i].output =
@@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		qp->jobs[i].metadata =
 			(uint8_t *)qp->metadata->addr +
 			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
+		if (qp->imkey_addr)
+			qp->jobs[i].imkey_array = (struct mlx5_klm *)
+				qp->imkey_addr->addr +
+				(i % qp->nb_desc) * MLX5_REGEX_MAX_KLM_NUM;
 	}
+
 	return 0;
 
+err_imkey:
+	ptr = qp->outputs->addr;
+	rte_free(ptr);
+	mlx5_glue->dereg_mr(qp->outputs);
 err_output:
 	ptr = qp->metadata->addr;
 	rte_free(ptr);
@@ -417,23 +691,57 @@ int
 mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
-	int err;
+	struct mlx5_klm klm = { 0 };
+	struct mlx5_devx_mkey_attr attr = {
+		.klm_array = &klm,
+		.klm_num = 1,
+		.umr_en = 1,
+	};
+	uint32_t i;
+	int err = 0;
 
 	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
 	if (!qp->jobs)
 		return -ENOMEM;
-	err = setup_buffers(qp, priv->pd);
+	err = setup_buffers(priv, qp);
 	if (err) {
 		rte_free(qp->jobs);
 		return err;
 	}
-	setup_sqs(qp);
-	return 0;
+
+	setup_sqs(priv, qp);
+
+	if (priv->has_umr) {
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+		if (regex_get_pdn(priv->pd, &attr.pd)) {
+			err = -rte_errno;
+			DRV_LOG(ERR, "Failed to get pdn.");
+			mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			return err;
+		}
+#endif
+		for (i = 0; i < qp->nb_desc; i++) {
+			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
+			attr.klm_array = qp->jobs[i].imkey_array;
+			qp->jobs[i].imkey = mlx5_devx_cmd_mkey_create(priv->ctx,
+								      &attr);
+			if (!qp->jobs[i].imkey) {
+				err = -rte_errno;
+				DRV_LOG(ERR, "Failed to allocate imkey.");
+				mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			}
+		}
+	}
+	return err;
 }
 
 static void
 free_buffers(struct mlx5_regex_qp *qp)
 {
+	if (qp->imkey_addr) {
+		mlx5_glue->dereg_mr(qp->imkey_addr);
+		rte_free(qp->imkey_addr->addr);
+	}
 	if (qp->metadata) {
 		mlx5_glue->dereg_mr(qp->metadata);
 		rte_free(qp->metadata->addr);
@@ -448,8 +756,14 @@ void
 mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
+	uint32_t i;
 
 	if (qp) {
+		for (i = 0; i < qp->nb_desc; i++) {
+			if (qp->jobs[i].imkey)
+				claim_zero(mlx5_devx_cmd_destroy
+							(qp->jobs[i].imkey));
+		}
 		free_buffers(qp);
 		if (qp->jobs)
 			rte_free(qp->jobs);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 3/4] app/test-regex: support scattered mbuf input
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: add user memory registration bits Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-30  1:39   ` Suanming Mou
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  1:39 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commits adds the scattered mbuf input support.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 app/test-regex/main.c          | 134 +++++++++++++++++++++++++++------
 doc/guides/tools/testregex.rst |   3 +
 2 files changed, 112 insertions(+), 25 deletions(-)

diff --git a/app/test-regex/main.c b/app/test-regex/main.c
index aea4fa6b88..82cffaacfa 100644
--- a/app/test-regex/main.c
+++ b/app/test-regex/main.c
@@ -35,6 +35,7 @@ enum app_args {
 	ARG_NUM_OF_ITERATIONS,
 	ARG_NUM_OF_QPS,
 	ARG_NUM_OF_LCORES,
+	ARG_NUM_OF_MBUF_SEGS,
 };
 
 struct job_ctx {
@@ -70,6 +71,7 @@ struct regex_conf {
 	char *data_buf;
 	long data_len;
 	long job_len;
+	uint32_t nb_segs;
 };
 
 static void
@@ -82,14 +84,15 @@ usage(const char *prog_name)
 		" --perf N: only outputs the performance data\n"
 		" --nb_iter N: number of iteration to run\n"
 		" --nb_qps N: number of queues to use\n"
-		" --nb_lcores N: number of lcores to use\n",
+		" --nb_lcores N: number of lcores to use\n"
+		" --nb_segs N: number of mbuf segments\n",
 		prog_name);
 }
 
 static void
 args_parse(int argc, char **argv, char *rules_file, char *data_file,
 	   uint32_t *nb_jobs, bool *perf_mode, uint32_t *nb_iterations,
-	   uint32_t *nb_qps, uint32_t *nb_lcores)
+	   uint32_t *nb_qps, uint32_t *nb_lcores, uint32_t *nb_segs)
 {
 	char **argvopt;
 	int opt;
@@ -111,6 +114,8 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		{ "nb_qps", 1, 0, ARG_NUM_OF_QPS},
 		/* Number of lcores. */
 		{ "nb_lcores", 1, 0, ARG_NUM_OF_LCORES},
+		/* Number of mbuf segments. */
+		{ "nb_segs", 1, 0, ARG_NUM_OF_MBUF_SEGS},
 		/* End of options */
 		{ 0, 0, 0, 0 }
 	};
@@ -150,6 +155,9 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		case ARG_NUM_OF_LCORES:
 			*nb_lcores = atoi(optarg);
 			break;
+		case ARG_NUM_OF_MBUF_SEGS:
+			*nb_segs = atoi(optarg);
+			break;
 		case ARG_HELP:
 			usage("RegEx test app");
 			break;
@@ -302,11 +310,75 @@ extbuf_free_cb(void *addr __rte_unused, void *fcb_opaque __rte_unused)
 {
 }
 
+static inline struct rte_mbuf *
+regex_create_segmented_mbuf(struct rte_mempool *mbuf_pool, int pkt_len,
+		int nb_segs, void *buf) {
+
+	struct rte_mbuf *m = NULL, *mbuf = NULL;
+	uint8_t *dst;
+	char *src = buf;
+	int data_len = 0;
+	int i, size;
+	int t_len;
+
+	if (pkt_len < 1) {
+		printf("Packet size must be 1 or more (is %d)\n", pkt_len);
+		return NULL;
+	}
+
+	if (nb_segs < 1) {
+		printf("Number of segments must be 1 or more (is %d)\n",
+				nb_segs);
+		return NULL;
+	}
+
+	t_len = pkt_len >= nb_segs ? (pkt_len / nb_segs +
+				     !!(pkt_len % nb_segs)) : 1;
+	size = pkt_len;
+
+	/* Create chained mbuf_src and fill it with buf data */
+	for (i = 0; size > 0; i++) {
+
+		m = rte_pktmbuf_alloc(mbuf_pool);
+		if (i == 0)
+			mbuf = m;
+
+		if (m == NULL) {
+			printf("Cannot create segment for source mbuf");
+			goto fail;
+		}
+
+		data_len = size > t_len ? t_len : size;
+		memset(rte_pktmbuf_mtod(m, uint8_t *), 0,
+				rte_pktmbuf_tailroom(m));
+		memcpy(rte_pktmbuf_mtod(m, uint8_t *), src, data_len);
+		dst = (uint8_t *)rte_pktmbuf_append(m, data_len);
+		if (dst == NULL) {
+			printf("Cannot append %d bytes to the mbuf\n",
+					data_len);
+			goto fail;
+		}
+
+		if (mbuf != m)
+			rte_pktmbuf_chain(mbuf, m);
+		src += data_len;
+		size -= data_len;
+
+	}
+	return mbuf;
+
+fail:
+	if (mbuf)
+		rte_pktmbuf_free(mbuf);
+	return NULL;
+}
+
 static int
 run_regex(void *args)
 {
 	struct regex_conf *rgxc = args;
 	uint32_t nb_jobs = rgxc->nb_jobs;
+	uint32_t nb_segs = rgxc->nb_segs;
 	uint32_t nb_iterations = rgxc->nb_iterations;
 	uint8_t nb_max_matches = rgxc->nb_max_matches;
 	uint32_t nb_qps = rgxc->nb_qps;
@@ -338,8 +410,12 @@ run_regex(void *args)
 	snprintf(mbuf_pool,
 		 sizeof(mbuf_pool),
 		 "mbuf_pool_%2u", qp_id_base);
-	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool, nb_jobs * nb_qps, 0,
-			0, MBUF_SIZE, rte_socket_id());
+	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool,
+			rte_align32pow2(nb_jobs * nb_qps * nb_segs),
+			0, 0, (nb_segs == 1) ? MBUF_SIZE :
+			(rte_align32pow2(job_len) / nb_segs +
+			RTE_PKTMBUF_HEADROOM),
+			rte_socket_id());
 	if (mbuf_mp == NULL) {
 		printf("Error, can't create memory pool\n");
 		return -ENOMEM;
@@ -375,8 +451,19 @@ run_regex(void *args)
 			goto end;
 		}
 
+		if (clone_buf(data_buf, &buf, data_len)) {
+			printf("Error, can't clone buf.\n");
+			res = -EXIT_FAILURE;
+			goto end;
+		}
+
+		/* Assign each mbuf with the data to handle. */
+		actual_jobs = 0;
+		pos = 0;
 		/* Allocate the jobs and assign each job with an mbuf. */
-		for (i = 0; i < nb_jobs; i++) {
+		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
+			long act_job_len = RTE_MIN(job_len, data_len - pos);
+
 			ops[i] = rte_malloc(NULL, sizeof(*ops[0]) +
 					nb_max_matches *
 					sizeof(struct rte_regexdev_match), 0);
@@ -386,30 +473,26 @@ run_regex(void *args)
 				res = -ENOMEM;
 				goto end;
 			}
-			ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+			if (nb_segs > 1) {
+				ops[i]->mbuf = regex_create_segmented_mbuf
+							(mbuf_mp, act_job_len,
+							 nb_segs, &buf[pos]);
+			} else {
+				ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+				if (ops[i]->mbuf) {
+					rte_pktmbuf_attach_extbuf(ops[i]->mbuf,
+					&buf[pos], 0, act_job_len, &shinfo);
+					ops[i]->mbuf->data_len = job_len;
+					ops[i]->mbuf->pkt_len = act_job_len;
+				}
+			}
 			if (!ops[i]->mbuf) {
-				printf("Error, can't attach mbuf.\n");
+				printf("Error, can't add mbuf.\n");
 				res = -ENOMEM;
 				goto end;
 			}
-		}
 
-		if (clone_buf(data_buf, &buf, data_len)) {
-			printf("Error, can't clone buf.\n");
-			res = -EXIT_FAILURE;
-			goto end;
-		}
-
-		/* Assign each mbuf with the data to handle. */
-		actual_jobs = 0;
-		pos = 0;
-		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
-			long act_job_len = RTE_MIN(job_len, data_len - pos);
-			rte_pktmbuf_attach_extbuf(ops[i]->mbuf, &buf[pos], 0,
-					act_job_len, &shinfo);
 			jobs_ctx[i].mbuf = ops[i]->mbuf;
-			ops[i]->mbuf->data_len = job_len;
-			ops[i]->mbuf->pkt_len = act_job_len;
 			ops[i]->user_id = i;
 			ops[i]->group_id0 = 1;
 			pos += act_job_len;
@@ -612,7 +695,7 @@ main(int argc, char **argv)
 	char *data_buf;
 	long data_len;
 	long job_len;
-	uint32_t nb_lcores = 1;
+	uint32_t nb_lcores = 1, nb_segs = 1;
 	struct regex_conf *rgxc;
 	uint32_t i;
 	struct qps_per_lcore *qps_per_lcore;
@@ -626,7 +709,7 @@ main(int argc, char **argv)
 	if (argc > 1)
 		args_parse(argc, argv, rules_file, data_file, &nb_jobs,
 				&perf_mode, &nb_iterations, &nb_qps,
-				&nb_lcores);
+				&nb_lcores, &nb_segs);
 
 	if (nb_qps == 0)
 		rte_exit(EXIT_FAILURE, "Number of QPs must be greater than 0\n");
@@ -656,6 +739,7 @@ main(int argc, char **argv)
 	for (i = 0; i < nb_lcores; i++) {
 		rgxc[i] = (struct regex_conf){
 			.nb_jobs = nb_jobs,
+			.nb_segs = nb_segs,
 			.perf_mode = perf_mode,
 			.nb_iterations = nb_iterations,
 			.nb_max_matches = nb_max_matches,
diff --git a/doc/guides/tools/testregex.rst b/doc/guides/tools/testregex.rst
index a59acd919f..cdb1ffd6ee 100644
--- a/doc/guides/tools/testregex.rst
+++ b/doc/guides/tools/testregex.rst
@@ -68,6 +68,9 @@ Application Options
 ``--nb_iter N``
   number of iteration to run
 
+``--nb_segs N``
+  number of mbuf segment
+
 ``--help``
   print application options
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
                     ` (2 preceding siblings ...)
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 3/4] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-03-30  1:39   ` Suanming Mou
  2021-04-06 16:22     ` Thomas Monjalon
  3 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  1:39 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland, John Hurley

From: John Hurley <jhurley@nvidia.com>

A recent change adds support for scattered mbuf and UMR support for regex.
Part of this commit makes the pi and ci counters of the regex_sq a quarter
of the length in non umr mode, effectively moving them from 16 bits to
14. The new get_free method casts the difference in pi and ci to a 16 bit
value when calculating the free send queues, accounting for any wrapping
when pi has looped back to 0 but ci has not yet. However, the move to 14
bits while still casting to 16 can now lead to corrupted, large values
returned.

Modify the get_free function to take in the has_umr flag and, accordingly,
account for wrapping on either 14 or 16 bit pi/ci difference.

Fixes: d55c9f637263 ("regex/mlx5: add data path scattered mbuf process")
Signed-off-by: John Hurley <jhurley@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index 4f9402c583..b57e7d7794 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -192,8 +192,10 @@ send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 }
 
 static inline int
-get_free(struct mlx5_regex_sq *sq) {
-	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
+get_free(struct mlx5_regex_sq *sq, uint8_t has_umr) {
+	return (sq_size_get(sq) - ((sq->pi - sq->ci) &
+			(has_umr ? (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+			MLX5_REGEX_MAX_WQE_INDEX)));
 }
 
 static inline uint32_t
@@ -385,7 +387,7 @@ mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		nb_desc = get_free(sq);
+		nb_desc = get_free(sq, priv->has_umr);
 		if (nb_desc) {
 			/* The ops be handled can't exceed nb_ops. */
 			if (nb_desc > nb_left)
@@ -418,7 +420,7 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (get_free(sq)) {
+		while (get_free(sq, priv->has_umr)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-30  8:05     ` Slava Ovsiienko
  2021-03-30  9:00       ` Suanming Mou
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-03-30  8:05 UTC (permalink / raw)
  To: Suanming Mou, Ori Kam; +Cc: dev, Matan Azrad, Raslan Darawsheh

> -----Original Message-----
> From: Suanming Mou <suanmingm@nvidia.com>
> Sent: Tuesday, March 30, 2021 4:39
> To: Ori Kam <orika@nvidia.com>
> Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Raslan Darawsheh <rasland@nvidia.com>
> Subject: [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process
> 
Nice feature, but I would fix the typos and reword a bit:

> UMR WQE can convert multiple mkey's memory sapce to contiguous space.
Typo: "sapce?"

And rather not "convert mkey" but "present data buffers scattered within
multiple mbufs with single indirect mkey".


> Take advantage of the UMR WQE, scattered mbuf in one operation can be
> converted to an indirect mkey. The RegEx which only accepts one mkey can
> now process the whole scattered mbuf.
I would add "in one operation."

> 
> The maximum scattered mbuf can be supported in one UMR WQE is now
> defined as 64. Multiple operations scattered mbufs can be add to one UMR
Typos: "THE multiple", "added"

I would reword - "The mbufs from multiple operations can be combined into 
one UMR. Also, I would add few words what UMR is.

> WQE if there is enough space in the KLM array, since the operations can
> address their own mbuf's content by the mkey's address and length.
> However, one operation's scattered mbuf's can't be placed in two different
> UMR WQE's KLM array, if the UMR WQE's KLM does not has enough free
> space for one operation, a new UMR WQE will be required.
I would say "the extra UMR WQE will be engaged"

> 
> In case the UMR WQE's indirect mkey will be over wrapped by the SQ's WQE
> move, the meky's index used by the UMR WQE should be the index of last
typo: "meky"

> the RegEX WQE in the operations. As one operation consumes one WQE set,
> build the RegEx WQE by reverse helps address the mkey more efficiently.
typo: TO address

With best regards,
Slava

> Once the operations in one burst consumes multiple mkeys, when the mkey
> KLM array is full, the reverse WQE set index will always be the last of the new
> mkey's for the new UMR WQE.
> 
> In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
> WQE by interleave. The UMR and RegEx WQE can be called as WQE set. The
> SQ's pi and ci will also be increased as WQE set not as WQE.
> 
> For operations don't have scattered mbuf, uses the mbuf's mkey directly,
> the WQE set combination is NOP + RegEx.
> For operations have scattered mubf but share the UMR WQE with others,
> the WQE set combination is NOP + RegEx.
> For operations complete the UMR WQE, the WQE set combination is UMR +
> RegEx.
> 
> Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
> Acked-by: Ori Kam <orika@nvidia.com>
> ---
>  doc/guides/regexdevs/mlx5.rst            |   5 +
>  doc/guides/rel_notes/release_21_05.rst   |   4 +
>  drivers/regex/mlx5/mlx5_regex.c          |   9 +
>  drivers/regex/mlx5/mlx5_regex.h          |  26 +-
>  drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
> drivers/regex/mlx5/mlx5_regex_fastpath.c | 378
> +++++++++++++++++++++--
>  6 files changed, 407 insertions(+), 58 deletions(-)
> 
> diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
> index faaa6ac11d..45a0b96980 100644
> --- a/doc/guides/regexdevs/mlx5.rst
> +++ b/doc/guides/regexdevs/mlx5.rst
> @@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device
> can be probed and used with  other Mellanox devices, by adding more
> options in the class.
>  For example: ``class=net:regex`` will probe both the net PMD and the RegEx
> PMD.
> 
> +Features
> +--------
> +
> +- Multi segments mbuf support.
> +
>  Supported NICs
>  --------------
> 
> diff --git a/doc/guides/rel_notes/release_21_05.rst
> b/doc/guides/rel_notes/release_21_05.rst
> index 3c76148b11..c3d6b8e8ae 100644
> --- a/doc/guides/rel_notes/release_21_05.rst
> +++ b/doc/guides/rel_notes/release_21_05.rst
> @@ -119,6 +119,10 @@ New Features
>    * Added command to display Rx queue used descriptor count.
>      ``show port (port_id) rxq (queue_id) desc used count``
> 
> +* **Updated Mellanox RegEx PMD.**
> +
> +  * Added support for multi segments mbuf.
> +
> 
>  Removed Items
>  -------------
> diff --git a/drivers/regex/mlx5/mlx5_regex.c
> b/drivers/regex/mlx5/mlx5_regex.c index ac5b205fa9..82c485e50c 100644
> --- a/drivers/regex/mlx5/mlx5_regex.c
> +++ b/drivers/regex/mlx5/mlx5_regex.c
> @@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver
> *pci_drv __rte_unused,
>  	}
>  	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
>  	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
> +#ifdef HAVE_MLX5_UMR_IMKEY
> +	if (!attr.umr_indirect_mkey_disabled &&
> +	    !attr.umr_modify_entity_size_disabled)
> +		priv->has_umr = 1;
> +	if (priv->has_umr)
> +		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
> #endif
>  	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
>  	priv->regexdev->device = (struct rte_device *)pci_dev;
>  	priv->regexdev->data->dev_private = priv; @@ -213,6 +220,8 @@
> mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
>  	    rte_errno = ENOMEM;
>  		goto error;
>  	}
> +	DRV_LOG(INFO, "RegEx GGA is %s.",
> +		priv->has_umr ? "supported" : "unsupported");
>  	return 0;
> 
>  error:
> diff --git a/drivers/regex/mlx5/mlx5_regex.h
> b/drivers/regex/mlx5/mlx5_regex.h index a2b3f0d9f3..51a2101e53 100644
> --- a/drivers/regex/mlx5/mlx5_regex.h
> +++ b/drivers/regex/mlx5/mlx5_regex.h
> @@ -15,6 +15,7 @@
>  #include <mlx5_common_devx.h>
> 
>  #include "mlx5_rxp.h"
> +#include "mlx5_regex_utils.h"
> 
>  struct mlx5_regex_sq {
>  	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
> @@ -40,6 +41,7 @@ struct mlx5_regex_qp {
>  	struct mlx5_regex_job *jobs;
>  	struct ibv_mr *metadata;
>  	struct ibv_mr *outputs;
> +	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
>  	size_t ci, pi;
>  	struct mlx5_mr_ctrl mr_ctrl;
>  };
> @@ -71,8 +73,29 @@ struct mlx5_regex_priv {
>  	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache.
> */
>  	uint8_t is_bf2; /* The device is BF2 device. */
>  	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats.
> */
> +	uint8_t has_umr; /* The device supports UMR. */
>  };
> 
> +#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> +static inline int
> +regex_get_pdn(void *pd, uint32_t *pdn)
> +{
> +	struct mlx5dv_obj obj;
> +	struct mlx5dv_pd pd_info;
> +	int ret = 0;
> +
> +	obj.pd.in = pd;
> +	obj.pd.out = &pd_info;
> +	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
> +	if (ret) {
> +		DRV_LOG(DEBUG, "Fail to get PD object info");
> +		return ret;
> +	}
> +	*pdn = pd_info.pdn;
> +	return 0;
> +}
> +#endif
> +
>  /* mlx5_regex.c */
>  int mlx5_regex_start(struct rte_regexdev *dev);  int
> mlx5_regex_stop(struct rte_regexdev *dev); @@ -108,5 +131,6 @@
> uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t
> qp_id,
>  		       struct rte_regex_ops **ops, uint16_t nb_ops);  uint16_t
> mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
>  		       struct rte_regex_ops **ops, uint16_t nb_ops);
> -
> +uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t
> qp_id,
> +		       struct rte_regex_ops **ops, uint16_t nb_ops);
>  #endif /* MLX5_REGEX_H */
> diff --git a/drivers/regex/mlx5/mlx5_regex_control.c
> b/drivers/regex/mlx5/mlx5_regex_control.c
> index 55fbb419ed..eef0fe579d 100644
> --- a/drivers/regex/mlx5/mlx5_regex_control.c
> +++ b/drivers/regex/mlx5/mlx5_regex_control.c
> @@ -27,6 +27,9 @@
> 
>  #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
> 
> +#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
> +		((has_umr) ? ((log_desc) + 2) : (log_desc))
> +
>  /**
>   * Returns the number of qp obj to be created.
>   *
> @@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv,
> struct mlx5_regex_cq *cq)
>  	return 0;
>  }
> 
> -#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> -static int
> -regex_get_pdn(void *pd, uint32_t *pdn)
> -{
> -	struct mlx5dv_obj obj;
> -	struct mlx5dv_pd pd_info;
> -	int ret = 0;
> -
> -	obj.pd.in = pd;
> -	obj.pd.out = &pd_info;
> -	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
> -	if (ret) {
> -		DRV_LOG(DEBUG, "Fail to get PD object info");
> -		return ret;
> -	}
> -	*pdn = pd_info.pdn;
> -	return 0;
> -}
> -#endif
> -
>  /**
>   * Destroy the SQ object.
>   *
> @@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv,
> struct mlx5_regex_qp *qp,
>  	int ret;
> 
>  	sq->log_nb_desc = log_nb_desc;
> +	sq->sqn = q_ind;
>  	sq->ci = 0;
>  	sq->pi = 0;
>  	ret = regex_get_pdn(priv->pd, &pd_num);
>  	if (ret)
>  		return ret;
>  	attr.wq_attr.pd = pd_num;
> -	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc,
> &attr,
> -				  SOCKET_ID_ANY);
> +	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
> +			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr,
> log_nb_desc),
> +			&attr, SOCKET_ID_ANY);
>  	if (ret) {
>  		DRV_LOG(ERR, "Can't create SQ object.");
>  		rte_errno = ENOMEM;
> @@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev,
> uint16_t qp_ind,
> 
>  	qp = &priv->qps[qp_ind];
>  	qp->flags = cfg->qp_conf_flags;
> -	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
> -	qp->nb_desc = 1 << qp->cq.log_nb_desc;
> +	log_desc = rte_log2_u32(cfg->nb_desc);
> +	/*
> +	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one
> descriptor.
> +	 * For CQ, expand the CQE number multiple with 2.
> +	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4
> WQEBBS,
> +	 * expand the WQE number multiple with 4.
> +	 */
> +	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
> +	qp->nb_desc = 1 << log_desc;
>  	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
> -		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
> +		qp->nb_obj = regex_ctrl_get_nb_obj
> +			(1 << MLX5_REGEX_WQE_LOG_NUM(priv-
> >has_umr, log_desc));
>  	else
>  		qp->nb_obj = 1;
>  	qp->sqs = rte_malloc(NULL,
> diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> index beaea7b63f..4f9402c583 100644
> --- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
> +++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
> @@ -32,6 +32,15 @@
>  #define MLX5_REGEX_WQE_GATHER_OFFSET 32  #define
> MLX5_REGEX_WQE_SCATTER_OFFSET 48  #define
> MLX5_REGEX_METADATA_OFF 32
> +#define MLX5_REGEX_UMR_WQE_SIZE 192
> +/* The maximum KLMs can be added to one UMR indirect mkey. */ #define
> +MLX5_REGEX_MAX_KLM_NUM 128
> +/* The KLM array size for one job. */
> +#define MLX5_REGEX_KLMS_SIZE \
> +	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
> +/* In WQE set mode, the pi should be quarter of the
> +MLX5_REGEX_MAX_WQE_INDEX. */ #define
> MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
> +	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
> 
>  static inline uint32_t
>  sq_size_get(struct mlx5_regex_sq *sq)
> @@ -49,6 +58,8 @@ struct mlx5_regex_job {
>  	uint64_t user_id;
>  	volatile uint8_t *output;
>  	volatile uint8_t *metadata;
> +	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
> +	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
>  } __rte_cached_aligned;
> 
>  static inline void
> @@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg,
> uint16_t pi, uint8_t opcode,  }
> 
>  static inline void
> -prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
> -	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
> -	 struct mlx5_regex_job *job)
> +__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
> +	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
> +	   size_t pi, struct mlx5_klm *klm)
>  {
> -	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) *
> MLX5_SEND_WQE_BB;
> -	uint32_t lkey;
> +	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
> +			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0))
> +
> +			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE :
> 0);
>  	uint16_t group0 = op->req_flags &
> RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
>  				op->group_id0 : 0;
>  	uint16_t group1 = op->req_flags &
> RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
> @@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
>  			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
>  			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
>  		group0 = op->group_id0;
> -	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
> -				  &priv->mr_scache, &qp->mr_ctrl,
> -				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
> -				  !!(op->mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
>  	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
>  	int ds = 4; /*  ctrl + meta + input + output */
> 
> -	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
> +	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
> +			 (priv->has_umr ? (pi * 4 + 3) : pi),
>  			 MLX5_OPCODE_MMO,
> MLX5_OPC_MOD_MMO_REGEX,
>  			 sq->sq_obj.sq->id, 0, ds, 0, 0);
>  	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
> @@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
>  	struct mlx5_wqe_data_seg *input_seg =
>  		(struct mlx5_wqe_data_seg *)(wqe +
> 
> MLX5_REGEX_WQE_GATHER_OFFSET);
> -	input_seg->byte_count =
> -		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
> -	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op-
> >mbuf,
> -							    uintptr_t));
> -	input_seg->lkey = lkey;
> +	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
> +	input_seg->addr = rte_cpu_to_be_64(klm->address);
> +	input_seg->lkey = klm->mkey;
>  	job->user_id = op->user_id;
> +}
> +
> +static inline void
> +prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
> +	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
> +	 struct mlx5_regex_job *job)
> +{
> +	struct mlx5_klm klm;
> +
> +	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
> +	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
> +				  &priv->mr_scache, &qp->mr_ctrl,
> +				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
> +				  !!(op->mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
> +	__prep_one(priv, sq, op, job, sq->pi, &klm);
>  	sq->db_pi = sq->pi;
>  	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;  }
> 
>  static inline void
> -send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
> +send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
>  {
> +	struct mlx5dv_devx_uar *uar = priv->uar;
>  	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
> -		MLX5_SEND_WQE_BB;
> +		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
> +		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
>  	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
> -	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se =
> MLX5_WQE_CTRL_CQ_UPDATE;
> +	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
> +	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |=
> +MLX5_WQE_CTRL_CQ_UPDATE;
>  	uint64_t *doorbell_addr =
>  		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
>  	rte_io_wmb();
> -	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq-
> >db_pi + 1) &
> -
> MLX5_REGEX_MAX_WQE_INDEX);
> +	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv-
> >has_umr ?
> +					(sq->db_pi * 4 + 3) : sq->db_pi) &
> +					MLX5_REGEX_MAX_WQE_INDEX);
>  	rte_wmb();
>  	*doorbell_addr = *(volatile uint64_t *)wqe;
>  	rte_wmb();
>  }
> 
>  static inline int
> -can_send(struct mlx5_regex_sq *sq) {
> -	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
> +get_free(struct mlx5_regex_sq *sq) {
> +	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
>  }
> 
>  static inline uint32_t
> @@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t
> index) {
>  	return qid * sq_size + (index & (sq_size - 1));  }
> 
> +#ifdef HAVE_MLX5_UMR_IMKEY
> +static inline int
> +mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new) {
> +	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM)); }
> +
> +static inline void
> +complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
> +		 struct mlx5_regex_job *mkey_job,
> +		 size_t umr_index, uint32_t klm_size, uint32_t total_len) {
> +	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
> +		(MLX5_SEND_WQE_BB * 4);
> +	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg
> *)((uint8_t *)
> +				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
> +	struct mlx5_wqe_umr_ctrl_seg *ucseg =
> +				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
> +	struct mlx5_wqe_mkey_context_seg *mkc =
> +				(struct mlx5_wqe_mkey_context_seg
> *)(ucseg + 1);
> +	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
> +	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
> +
> +	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
> +	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9
> WQE_DS. */
> +	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
> +			 0, sq->sq_obj.sq->id, 0, 9, 0,
> +			 rte_cpu_to_be_32(mkey_job->imkey->id));
> +	/* Set UMR WQE control seg. */
> +	ucseg->mkey_mask |=
> rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
> +
> 	MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
> +
> 	MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
> +	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
> +	/* Set mkey context seg. */
> +	mkc->len = rte_cpu_to_be_64(total_len);
> +	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
> +					(mkey_job->imkey->id & 0xff));
> +	/* Set UMR pointer to data seg. */
> +	iklm->address = rte_cpu_to_be_64
> +				((uintptr_t)((char *)mkey_job-
> >imkey_array));
> +	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
> +	iklm->byte_count = rte_cpu_to_be_32(klm_align);
> +	/* Clear the padding memory. */
> +	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
> +	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
> +
> +	/* Add the following RegEx WQE with fence. */
> +	wqe = (struct mlx5_wqe_ctrl_seg *)
> +				(((uint8_t *)wqe) +
> MLX5_REGEX_UMR_WQE_SIZE);
> +	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
> +}
> +
> +static inline void
> +prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct
> mlx5_regex_sq *sq,
> +		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
> +		       size_t pi, struct mlx5_klm *klm) {
> +	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
> +			    (MLX5_SEND_WQE_BB << 2);
> +	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg
> *)((uint8_t *)
> +				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
> +
> +	/* Clear the WQE memory used as UMR WQE previously. */
> +	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) !=
> MLX5_OPCODE_NOP)
> +		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
> +	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
> +	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq-
> >sq_obj.sq->id,
> +			 0, 12, 0, 0);
> +	__prep_one(priv, sq, op, job, pi, klm); }
> +
> +static inline void
> +prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct
> mlx5_regex_qp *qp,
> +	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t
> nb_ops) {
> +	struct mlx5_regex_job *job = NULL;
> +	size_t sqid = sq->sqn, mkey_job_id = 0;
> +	size_t left_ops = nb_ops;
> +	uint32_t klm_num = 0, len;
> +	struct mlx5_klm *mkey_klm = NULL;
> +	struct mlx5_klm klm;
> +
> +	sqid = sq->sqn;
> +	while (left_ops--)
> +		rte_prefetch0(op[left_ops]);
> +	left_ops = nb_ops;
> +	/*
> +	 * Build the WQE set by reverse. In case the burst may consume
> +	 * multiple mkeys, build the WQE set as normal will hard to
> +	 * address the last mkey index, since we will only know the last
> +	 * RegEx WQE's index when finishes building.
> +	 */
> +	while (left_ops--) {
> +		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
> +		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
> +
> +		if (mbuf->nb_segs > 1) {
> +			size_t scatter_size = 0;
> +
> +			if (!mkey_klm_available(mkey_klm, klm_num,
> +						mbuf->nb_segs)) {
> +				/*
> +				 * The mkey's KLM is full, create the UMR
> +				 * WQE in the next WQE set.
> +				 */
> +				if (mkey_klm)
> +					complete_umr_wqe(qp, sq,
> +						&qp->jobs[mkey_job_id],
> +
> 	MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
> +						klm_num, len);
> +				/*
> +				 * Get the indircet mkey and KLM array index
> +				 * from the last WQE set.
> +				 */
> +				mkey_job_id = job_id_get(sqid,
> +							 sq_size_get(sq), pi);
> +				mkey_klm = qp-
> >jobs[mkey_job_id].imkey_array;
> +				klm_num = 0;
> +				len = 0;
> +			}
> +			/* Build RegEx WQE's data segment KLM. */
> +			klm.address = len;
> +			klm.mkey = rte_cpu_to_be_32
> +					(qp->jobs[mkey_job_id].imkey->id);
> +			while (mbuf) {
> +				/* Build indirect mkey seg's KLM. */
> +				mkey_klm->mkey =
> mlx5_mr_addr2mr_bh(priv->pd,
> +					NULL, &priv->mr_scache, &qp-
> >mr_ctrl,
> +					rte_pktmbuf_mtod(mbuf, uintptr_t),
> +					!!(mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +				mkey_klm->address = rte_cpu_to_be_64
> +					(rte_pktmbuf_mtod(mbuf,
> uintptr_t));
> +				mkey_klm->byte_count = rte_cpu_to_be_32
> +
> 	(rte_pktmbuf_data_len(mbuf));
> +				/*
> +				 * Save the mbuf's total size for RegEx data
> +				 * segment.
> +				 */
> +				scatter_size +=
> rte_pktmbuf_data_len(mbuf);
> +				mkey_klm++;
> +				klm_num++;
> +				mbuf = mbuf->next;
> +			}
> +			len += scatter_size;
> +			klm.byte_count = scatter_size;
> +		} else {
> +			/* The single mubf case. Build the KLM directly. */
> +			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
> +					&priv->mr_scache, &qp->mr_ctrl,
> +					rte_pktmbuf_mtod(mbuf, uintptr_t),
> +					!!(mbuf->ol_flags &
> EXT_ATTACHED_MBUF));
> +			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
> +			klm.byte_count = rte_pktmbuf_data_len(mbuf);
> +		}
> +		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
> +		/*
> +		 * Build the nop + RegEx WQE set by default. The fist nop
> WQE
> +		 * will be updated later as UMR WQE if scattered mubf exist.
> +		 */
> +		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi,
> &klm);
> +	}
> +	/*
> +	 * Scattered mbuf have been added to the KLM array. Complete the
> build
> +	 * of UMR WQE, update the first nop WQE as UMR WQE.
> +	 */
> +	if (mkey_klm)
> +		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq-
> >pi,
> +				 klm_num, len);
> +	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
> +	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops); }
> +
> +uint16_t
> +mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
> +			  struct rte_regex_ops **ops, uint16_t nb_ops) {
> +	struct mlx5_regex_priv *priv = dev->data->dev_private;
> +	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
> +	struct mlx5_regex_sq *sq;
> +	size_t sqid, nb_left = nb_ops, nb_desc;
> +
> +	while ((sqid = ffs(queue->free_sqs))) {
> +		sqid--; /* ffs returns 1 for bit 0 */
> +		sq = &queue->sqs[sqid];
> +		nb_desc = get_free(sq);
> +		if (nb_desc) {
> +			/* The ops be handled can't exceed nb_ops. */
> +			if (nb_desc > nb_left)
> +				nb_desc = nb_left;
> +			else
> +				queue->free_sqs &= ~(1 << sqid);
> +			prep_regex_umr_wqe_set(priv, queue, sq, ops,
> nb_desc);
> +			send_doorbell(priv, sq);
> +			nb_left -= nb_desc;
> +		}
> +		if (!nb_left)
> +			break;
> +		ops += nb_desc;
> +	}
> +	nb_ops -= nb_left;
> +	queue->pi += nb_ops;
> +	return nb_ops;
> +}
> +#endif
> +
>  uint16_t
>  mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
>  		      struct rte_regex_ops **ops, uint16_t nb_ops) @@ -
> 186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  	while ((sqid = ffs(queue->free_sqs))) {
>  		sqid--; /* ffs returns 1 for bit 0 */
>  		sq = &queue->sqs[sqid];
> -		while (can_send(sq)) {
> +		while (get_free(sq)) {
>  			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
>  			prep_one(priv, queue, sq, ops[i], &queue-
> >jobs[job_id]);
>  			i++;
>  			if (unlikely(i == nb_ops)) {
> -				send_doorbell(priv->uar, sq);
> +				send_doorbell(priv, sq);
>  				goto out;
>  			}
>  		}
>  		queue->free_sqs &= ~(1 << sqid);
> -		send_doorbell(priv->uar, sq);
> +		send_doorbell(priv, sq);
>  	}
> 
>  out:
> @@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev
> *dev, uint16_t qp_id,
>  			  MLX5_REGEX_MAX_WQE_INDEX;
>  		size_t sqid = cqe->rsvd3[2];
>  		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
> +
> +		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
> +		if (priv->has_umr)
> +			wq_counter >>= 2;
>  		while (sq->ci != wq_counter) {
>  			if (unlikely(i == nb_ops)) {
>  				/* Return without updating cq->ci */ @@ -
> 316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev,
> uint16_t qp_id,
>  			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
>  						     sq->ci);
>  			extract_result(ops[i], &queue->jobs[job_id]);
> -			sq->ci = (sq->ci + 1) &
> MLX5_REGEX_MAX_WQE_INDEX;
> +			sq->ci = (sq->ci + 1) & (priv->has_umr ?
> +				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
> +				  MLX5_REGEX_MAX_WQE_INDEX);
>  			i++;
>  		}
>  		cq->ci = (cq->ci + 1) & 0xffffff;
> @@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev,
> uint16_t qp_id,  }
> 
>  static void
> -setup_sqs(struct mlx5_regex_qp *queue)
> +setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
>  {
>  	size_t sqid, entry;
>  	uint32_t job_id;
> @@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
>  			job_id = sqid * sq_size_get(sq) + entry;
>  			struct mlx5_regex_job *job = &queue->jobs[job_id];
> 
> +			/* Fill UMR WQE with NOP in advanced. */
> +			if (priv->has_umr) {
> +				set_wqe_ctrl_seg
> +					((struct mlx5_wqe_ctrl_seg *)wqe,
> +					 entry * 2, MLX5_OPCODE_NOP, 0,
> +					 sq->sq_obj.sq->id, 0, 12, 0, 0);
> +				wqe += MLX5_REGEX_UMR_WQE_SIZE;
> +			}
>  			set_metadata_seg((struct mlx5_wqe_metadata_seg
> *)
>  					 (wqe +
> MLX5_REGEX_WQE_METADATA_OFFSET),
>  					 0, queue->metadata->lkey,
> @@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)  }
> 
>  static int
> -setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
> +setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
>  {
> +	struct ibv_pd *pd = priv->pd;
>  	uint32_t i;
>  	int err;
> 
> @@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct
> ibv_pd *pd)
>  		goto err_output;
>  	}
> 
> +	if (priv->has_umr) {
> +		ptr = rte_calloc(__func__, qp->nb_desc,
> MLX5_REGEX_KLMS_SIZE,
> +				 MLX5_REGEX_KLMS_SIZE);
> +		if (!ptr) {
> +			err = -ENOMEM;
> +			goto err_imkey;
> +		}
> +		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
> +					MLX5_REGEX_KLMS_SIZE * qp-
> >nb_desc,
> +					IBV_ACCESS_LOCAL_WRITE);
> +		if (!qp->imkey_addr) {
> +			rte_free(ptr);
> +			DRV_LOG(ERR, "Failed to register output");
> +			err = -EINVAL;
> +			goto err_imkey;
> +		}
> +	}
> +
>  	/* distribute buffers to jobs */
>  	for (i = 0; i < qp->nb_desc; i++) {
>  		qp->jobs[i].output =
> @@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct
> ibv_pd *pd)
>  		qp->jobs[i].metadata =
>  			(uint8_t *)qp->metadata->addr +
>  			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
> +		if (qp->imkey_addr)
> +			qp->jobs[i].imkey_array = (struct mlx5_klm *)
> +				qp->imkey_addr->addr +
> +				(i % qp->nb_desc) *
> MLX5_REGEX_MAX_KLM_NUM;
>  	}
> +
>  	return 0;
> 
> +err_imkey:
> +	ptr = qp->outputs->addr;
> +	rte_free(ptr);
> +	mlx5_glue->dereg_mr(qp->outputs);
>  err_output:
>  	ptr = qp->metadata->addr;
>  	rte_free(ptr);
> @@ -417,23 +691,57 @@ int
>  mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t
> qp_id)  {
>  	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
> -	int err;
> +	struct mlx5_klm klm = { 0 };
> +	struct mlx5_devx_mkey_attr attr = {
> +		.klm_array = &klm,
> +		.klm_num = 1,
> +		.umr_en = 1,
> +	};
> +	uint32_t i;
> +	int err = 0;
> 
>  	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs),
> 64);
>  	if (!qp->jobs)
>  		return -ENOMEM;
> -	err = setup_buffers(qp, priv->pd);
> +	err = setup_buffers(priv, qp);
>  	if (err) {
>  		rte_free(qp->jobs);
>  		return err;
>  	}
> -	setup_sqs(qp);
> -	return 0;
> +
> +	setup_sqs(priv, qp);
> +
> +	if (priv->has_umr) {
> +#ifdef HAVE_IBV_FLOW_DV_SUPPORT
> +		if (regex_get_pdn(priv->pd, &attr.pd)) {
> +			err = -rte_errno;
> +			DRV_LOG(ERR, "Failed to get pdn.");
> +			mlx5_regexdev_teardown_fastpath(priv, qp_id);
> +			return err;
> +		}
> +#endif
> +		for (i = 0; i < qp->nb_desc; i++) {
> +			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
> +			attr.klm_array = qp->jobs[i].imkey_array;
> +			qp->jobs[i].imkey =
> mlx5_devx_cmd_mkey_create(priv->ctx,
> +								      &attr);
> +			if (!qp->jobs[i].imkey) {
> +				err = -rte_errno;
> +				DRV_LOG(ERR, "Failed to allocate imkey.");
> +				mlx5_regexdev_teardown_fastpath(priv,
> qp_id);
> +			}
> +		}
> +	}
> +	return err;
>  }
> 
>  static void
>  free_buffers(struct mlx5_regex_qp *qp)
>  {
> +	if (qp->imkey_addr) {
> +		mlx5_glue->dereg_mr(qp->imkey_addr);
> +		rte_free(qp->imkey_addr->addr);
> +	}
>  	if (qp->metadata) {
>  		mlx5_glue->dereg_mr(qp->metadata);
>  		rte_free(qp->metadata->addr);
> @@ -448,8 +756,14 @@ void
>  mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t
> qp_id)  {
>  	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
> +	uint32_t i;
> 
>  	if (qp) {
> +		for (i = 0; i < qp->nb_desc; i++) {
> +			if (qp->jobs[i].imkey)
> +				claim_zero(mlx5_devx_cmd_destroy
> +							(qp->jobs[i].imkey));
> +		}
>  		free_buffers(qp);
>  		if (qp->jobs)
>  			rte_free(qp->jobs);
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-30  8:05     ` Slava Ovsiienko
@ 2021-03-30  9:00       ` Suanming Mou
  0 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-30  9:00 UTC (permalink / raw)
  To: Slava Ovsiienko, Ori Kam; +Cc: dev, Matan Azrad, Raslan Darawsheh

Hi Slava,

> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Tuesday, March 30, 2021 4:05 PM
> To: Suanming Mou <suanmingm@nvidia.com>; Ori Kam <orika@nvidia.com>
> Cc: dev@dpdk.org; Matan Azrad <matan@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>
> Subject: RE: [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process
> 
> > -----Original Message-----
> > From: Suanming Mou <suanmingm@nvidia.com>
> > Sent: Tuesday, March 30, 2021 4:39
> > To: Ori Kam <orika@nvidia.com>
> > Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > Azrad <matan@nvidia.com>; Raslan Darawsheh <rasland@nvidia.com>
> > Subject: [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf
> > process
> >
> Nice feature, but I would fix the typos and reword a bit:
> 
> > UMR WQE can convert multiple mkey's memory sapce to contiguous space.
> Typo: "sapce?"
> 
> And rather not "convert mkey" but "present data buffers scattered within
> multiple mbufs with single indirect mkey".
> 
> 
> > Take advantage of the UMR WQE, scattered mbuf in one operation can be
> > converted to an indirect mkey. The RegEx which only accepts one mkey
> > can now process the whole scattered mbuf.
> I would add "in one operation."
> 
> >
> > The maximum scattered mbuf can be supported in one UMR WQE is now
> > defined as 64. Multiple operations scattered mbufs can be add to one
> > UMR
> Typos: "THE multiple", "added"
> 
> I would reword - "The mbufs from multiple operations can be combined into one
> UMR. Also, I would add few words what UMR is.
> 
> > WQE if there is enough space in the KLM array, since the operations
> > can address their own mbuf's content by the mkey's address and length.
> > However, one operation's scattered mbuf's can't be placed in two
> > different UMR WQE's KLM array, if the UMR WQE's KLM does not has
> > enough free space for one operation, a new UMR WQE will be required.
> I would say "the extra UMR WQE will be engaged"
> 
> >
> > In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
> > WQE move, the meky's index used by the UMR WQE should be the index of
> > last
> typo: "meky"
> 
> > the RegEX WQE in the operations. As one operation consumes one WQE
> > set, build the RegEx WQE by reverse helps address the mkey more efficiently.
> typo: TO address
> 
> With best regards,
> Slava
> 

Thanks very much for helping with the log improvement, I will wait for one or two days to see if there are other new comments and update it with the new version.

BR,
Suanming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (5 preceding siblings ...)
  2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-03-31  7:37 ` Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 1/4] common/mlx5: add user memory registration bits Suanming Mou
                     ` (3 more replies)
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  7 siblings, 4 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-31  7:37 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

The scattered mbuf was not supported in mlx5 RegEx driver. This
patch set adds the support of scattered mbuf by UMR WQE.

UMR(User-Mode Memory Registration) WQE can present data buffers
scattered within multiple mbufs with single indirect mkey. Take
advantage of the UMR WQE, scattered mbuf in one operation can be
presented to an indirect mkey. The RegEx which only accepts one
mkey can now process the whole scattered mbuf in one operation.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. The mbufs from multiple operations can be combined
into one UMR WQE as well if there is enough space in the KLM array,
since the operations can address their own mbuf's content by the
mkey's address and length. However, one operation's scattered mbuf's
can't be placed in two different UMR WQE's KLM array, if the UMR
WQE's KLM does not has enough free space for one operation, the
extra UMR WQE will be engaged.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the mkey's index used by the UMR WQE should be the index
of last the RegEX WQE in the operations. As one operation consumes
one WQE set, build the RegEx WQE by reverse helps address the mkey
more efficiently. Once the operations in one burst consumes multiple
mkeys, when the mkey KLM array is full, the reverse WQE set index
will always be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
WQE by interleave. The UMR and RegEx WQE can be called as WQE set.
The SQ's pi and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

v4:
1. git log improvement.

v3:
1. Move testregex.rst change to the correct commit.
2. Code rebase to the latest version.

v2:
1. Check mbuf multiple seg by nb_segs.
2. Add ops prefetch.
3. Allocate ops and mbuf memory together in test application.
4. Fix ci and pi in correct issue.

John Hurley (1):
  regex/mlx5: prevent wrong calculation of free sqs in umr mode

Suanming Mou (3):
  common/mlx5: add user memory registration bits
  regex/mlx5: add data path scattered mbuf process
  app/test-regex: support scattered mbuf input

 app/test-regex/main.c                    | 134 ++++++--
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 doc/guides/tools/testregex.rst           |   3 +
 drivers/common/mlx5/linux/meson.build    |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.c     |   5 +
 drivers/common/mlx5/mlx5_devx_cmds.h     |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 380 +++++++++++++++++++++--
 11 files changed, 531 insertions(+), 83 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v4 1/4] common/mlx5: add user memory registration bits
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-03-31  7:37   ` Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-31  7:37 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commit adds the UMR capability bits.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 2 ++
 drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
 drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 220de35420..5d6a861689 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -186,6 +186,8 @@ has_sym_args = [
 	'mlx5dv_dr_action_create_aso' ],
 	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
 	'INFINIBAND_VERBS_H' ],
+        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
+        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
 ]
 config = configuration_data()
 foreach arg:has_sym_args
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c90e020643..268bcd0d99 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, attr->pd);
 	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
+	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
 	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
@@ -752,6 +753,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						mini_cqe_resp_flow_tag);
 	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
 						 mini_cqe_resp_l3_l4_tag);
+	attr->umr_indirect_mkey_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_indirect_mkey_disabled);
+	attr->umr_modify_entity_size_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_modify_entity_size_disabled);
 	if (attr->qos.sup) {
 		MLX5_SET(query_hca_cap_in, in, op_mod,
 			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 2826c0b2c6..67b5f771c6 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t umr_en:1;
 	struct mlx5_klm *klm_array;
 	int klm_num;
 };
@@ -151,6 +152,8 @@ struct mlx5_hca_attr {
 	uint32_t log_max_mmo_dma:5;
 	uint32_t log_max_mmo_compress:5;
 	uint32_t log_max_mmo_decompress:5;
+	uint32_t umr_modify_entity_size_disabled:1;
+	uint32_t umr_indirect_mkey_disabled:1;
 };
 
 struct mlx5_devx_wq_attr {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v4 2/4] regex/mlx5: add data path scattered mbuf process
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 1/4] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-03-31  7:37   ` Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 3/4] app/test-regex: support scattered mbuf input Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-31  7:37 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

UMR(User-Mode Memory Registration) WQE can present data buffers
scattered within multiple mbufs with single indirect mkey. Take
advantage of the UMR WQE, scattered mbuf in one operation can be
presented to an indirect mkey. The RegEx which only accepts one
mkey can now process the whole scattered mbuf in one operation.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. The mbufs from multiple operations can be combined
into one UMR WQE as well if there is enough space in the KLM array,
since the operations can address their own mbuf's content by the
mkey's address and length. However, one operation's scattered mbuf's
can't be placed in two different UMR WQE's KLM array, if the UMR
WQE's KLM does not has enough free space for one operation, the
extra UMR WQE will be engaged.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the mkey's index used by the UMR WQE should be the index
of last the RegEX WQE in the operations. As one operation consumes
one WQE set, build the RegEx WQE by reverse helps address the mkey
more efficiently. Once the operations in one burst consumes multiple
mkeys, when the mkey KLM array is full, the reverse WQE set index
will always be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
WQE by interleave. The UMR and RegEx WQE can be called as WQE set.
The SQ's pi and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 378 +++++++++++++++++++++--
 6 files changed, 407 insertions(+), 58 deletions(-)

diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
index faaa6ac11d..45a0b96980 100644
--- a/doc/guides/regexdevs/mlx5.rst
+++ b/doc/guides/regexdevs/mlx5.rst
@@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device can be probed and used with
 other Mellanox devices, by adding more options in the class.
 For example: ``class=net:regex`` will probe both the net PMD and the RegEx PMD.
 
+Features
+--------
+
+- Multi segments mbuf support.
+
 Supported NICs
 --------------
 
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 3c76148b11..c3d6b8e8ae 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -119,6 +119,10 @@ New Features
   * Added command to display Rx queue used descriptor count.
     ``show port (port_id) rxq (queue_id) desc used count``
 
+* **Updated Mellanox RegEx PMD.**
+
+  * Added support for multi segments mbuf.
+
 
 Removed Items
 -------------
diff --git a/drivers/regex/mlx5/mlx5_regex.c b/drivers/regex/mlx5/mlx5_regex.c
index ac5b205fa9..82c485e50c 100644
--- a/drivers/regex/mlx5/mlx5_regex.c
+++ b/drivers/regex/mlx5/mlx5_regex.c
@@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	}
 	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
 	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
+#ifdef HAVE_MLX5_UMR_IMKEY
+	if (!attr.umr_indirect_mkey_disabled &&
+	    !attr.umr_modify_entity_size_disabled)
+		priv->has_umr = 1;
+	if (priv->has_umr)
+		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
+#endif
 	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
 	priv->regexdev->device = (struct rte_device *)pci_dev;
 	priv->regexdev->data->dev_private = priv;
@@ -213,6 +220,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	    rte_errno = ENOMEM;
 		goto error;
 	}
+	DRV_LOG(INFO, "RegEx GGA is %s.",
+		priv->has_umr ? "supported" : "unsupported");
 	return 0;
 
 error:
diff --git a/drivers/regex/mlx5/mlx5_regex.h b/drivers/regex/mlx5/mlx5_regex.h
index a2b3f0d9f3..51a2101e53 100644
--- a/drivers/regex/mlx5/mlx5_regex.h
+++ b/drivers/regex/mlx5/mlx5_regex.h
@@ -15,6 +15,7 @@
 #include <mlx5_common_devx.h>
 
 #include "mlx5_rxp.h"
+#include "mlx5_regex_utils.h"
 
 struct mlx5_regex_sq {
 	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
@@ -40,6 +41,7 @@ struct mlx5_regex_qp {
 	struct mlx5_regex_job *jobs;
 	struct ibv_mr *metadata;
 	struct ibv_mr *outputs;
+	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
 	size_t ci, pi;
 	struct mlx5_mr_ctrl mr_ctrl;
 };
@@ -71,8 +73,29 @@ struct mlx5_regex_priv {
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	uint8_t is_bf2; /* The device is BF2 device. */
 	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats. */
+	uint8_t has_umr; /* The device supports UMR. */
 };
 
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+static inline int
+regex_get_pdn(void *pd, uint32_t *pdn)
+{
+	struct mlx5dv_obj obj;
+	struct mlx5dv_pd pd_info;
+	int ret = 0;
+
+	obj.pd.in = pd;
+	obj.pd.out = &pd_info;
+	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
+	if (ret) {
+		DRV_LOG(DEBUG, "Fail to get PD object info");
+		return ret;
+	}
+	*pdn = pd_info.pdn;
+	return 0;
+}
+#endif
+
 /* mlx5_regex.c */
 int mlx5_regex_start(struct rte_regexdev *dev);
 int mlx5_regex_stop(struct rte_regexdev *dev);
@@ -108,5 +131,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
 uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
-
+uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+		       struct rte_regex_ops **ops, uint16_t nb_ops);
 #endif /* MLX5_REGEX_H */
diff --git a/drivers/regex/mlx5/mlx5_regex_control.c b/drivers/regex/mlx5/mlx5_regex_control.c
index 55fbb419ed..eef0fe579d 100644
--- a/drivers/regex/mlx5/mlx5_regex_control.c
+++ b/drivers/regex/mlx5/mlx5_regex_control.c
@@ -27,6 +27,9 @@
 
 #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
 
+#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
+		((has_umr) ? ((log_desc) + 2) : (log_desc))
+
 /**
  * Returns the number of qp obj to be created.
  *
@@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv, struct mlx5_regex_cq *cq)
 	return 0;
 }
 
-#ifdef HAVE_IBV_FLOW_DV_SUPPORT
-static int
-regex_get_pdn(void *pd, uint32_t *pdn)
-{
-	struct mlx5dv_obj obj;
-	struct mlx5dv_pd pd_info;
-	int ret = 0;
-
-	obj.pd.in = pd;
-	obj.pd.out = &pd_info;
-	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
-	if (ret) {
-		DRV_LOG(DEBUG, "Fail to get PD object info");
-		return ret;
-	}
-	*pdn = pd_info.pdn;
-	return 0;
-}
-#endif
-
 /**
  * Destroy the SQ object.
  *
@@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	int ret;
 
 	sq->log_nb_desc = log_nb_desc;
+	sq->sqn = q_ind;
 	sq->ci = 0;
 	sq->pi = 0;
 	ret = regex_get_pdn(priv->pd, &pd_num);
 	if (ret)
 		return ret;
 	attr.wq_attr.pd = pd_num;
-	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
-				  SOCKET_ID_ANY);
+	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
+			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_nb_desc),
+			&attr, SOCKET_ID_ANY);
 	if (ret) {
 		DRV_LOG(ERR, "Can't create SQ object.");
 		rte_errno = ENOMEM;
@@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev, uint16_t qp_ind,
 
 	qp = &priv->qps[qp_ind];
 	qp->flags = cfg->qp_conf_flags;
-	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
-	qp->nb_desc = 1 << qp->cq.log_nb_desc;
+	log_desc = rte_log2_u32(cfg->nb_desc);
+	/*
+	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one descriptor.
+	 * For CQ, expand the CQE number multiple with 2.
+	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4 WQEBBS,
+	 * expand the WQE number multiple with 4.
+	 */
+	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
+	qp->nb_desc = 1 << log_desc;
 	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
-		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
+		qp->nb_obj = regex_ctrl_get_nb_obj
+			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_desc));
 	else
 		qp->nb_obj = 1;
 	qp->sqs = rte_malloc(NULL,
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index beaea7b63f..4f9402c583 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -32,6 +32,15 @@
 #define MLX5_REGEX_WQE_GATHER_OFFSET 32
 #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
 #define MLX5_REGEX_METADATA_OFF 32
+#define MLX5_REGEX_UMR_WQE_SIZE 192
+/* The maximum KLMs can be added to one UMR indirect mkey. */
+#define MLX5_REGEX_MAX_KLM_NUM 128
+/* The KLM array size for one job. */
+#define MLX5_REGEX_KLMS_SIZE \
+	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
+/* In WQE set mode, the pi should be quarter of the MLX5_REGEX_MAX_WQE_INDEX. */
+#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
+	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
 
 static inline uint32_t
 sq_size_get(struct mlx5_regex_sq *sq)
@@ -49,6 +58,8 @@ struct mlx5_regex_job {
 	uint64_t user_id;
 	volatile uint8_t *output;
 	volatile uint8_t *metadata;
+	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
+	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
 } __rte_cached_aligned;
 
 static inline void
@@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
 }
 
 static inline void
-prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
-	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
-	 struct mlx5_regex_job *job)
+__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
+	   size_t pi, struct mlx5_klm *klm)
 {
-	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) * MLX5_SEND_WQE_BB;
-	uint32_t lkey;
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint16_t group0 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
 				op->group_id0 : 0;
 	uint16_t group1 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
@@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
 			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
 		group0 = op->group_id0;
-	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
-				  &priv->mr_scache, &qp->mr_ctrl,
-				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
-				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
 	int ds = 4; /*  ctrl + meta + input + output */
 
-	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
+	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
+			 (priv->has_umr ? (pi * 4 + 3) : pi),
 			 MLX5_OPCODE_MMO, MLX5_OPC_MOD_MMO_REGEX,
 			 sq->sq_obj.sq->id, 0, ds, 0, 0);
 	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
@@ -137,36 +146,54 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	struct mlx5_wqe_data_seg *input_seg =
 		(struct mlx5_wqe_data_seg *)(wqe +
 					     MLX5_REGEX_WQE_GATHER_OFFSET);
-	input_seg->byte_count =
-		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
-	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
-							    uintptr_t));
-	input_seg->lkey = lkey;
+	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
+	input_seg->addr = rte_cpu_to_be_64(klm->address);
+	input_seg->lkey = klm->mkey;
 	job->user_id = op->user_id;
+}
+
+static inline void
+prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
+	 struct mlx5_regex_job *job)
+{
+	struct mlx5_klm klm;
+
+	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
+	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
+				  &priv->mr_scache, &qp->mr_ctrl,
+				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
+				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
+	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
+	__prep_one(priv, sq, op, job, sq->pi, &klm);
 	sq->db_pi = sq->pi;
 	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
 }
 
 static inline void
-send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
+send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 {
+	struct mlx5dv_devx_uar *uar = priv->uar;
 	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
-		MLX5_SEND_WQE_BB;
+		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
-	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
+	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |= MLX5_WQE_CTRL_CQ_UPDATE;
 	uint64_t *doorbell_addr =
 		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
 	rte_io_wmb();
-	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi + 1) &
-						 MLX5_REGEX_MAX_WQE_INDEX);
+	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv->has_umr ?
+					(sq->db_pi * 4 + 3) : sq->db_pi) &
+					MLX5_REGEX_MAX_WQE_INDEX);
 	rte_wmb();
 	*doorbell_addr = *(volatile uint64_t *)wqe;
 	rte_wmb();
 }
 
 static inline int
-can_send(struct mlx5_regex_sq *sq) {
-	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
+get_free(struct mlx5_regex_sq *sq) {
+	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
 }
 
 static inline uint32_t
@@ -174,6 +201,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
 	return qid * sq_size + (index & (sq_size - 1));
 }
 
+#ifdef HAVE_MLX5_UMR_IMKEY
+static inline int
+mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
+{
+	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
+}
+
+static inline void
+complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
+		 struct mlx5_regex_job *mkey_job,
+		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
+{
+	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
+		(MLX5_SEND_WQE_BB * 4);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+	struct mlx5_wqe_umr_ctrl_seg *ucseg =
+				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
+	struct mlx5_wqe_mkey_context_seg *mkc =
+				(struct mlx5_wqe_mkey_context_seg *)(ucseg + 1);
+	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
+	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
+
+	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9 WQE_DS. */
+	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
+			 0, sq->sq_obj.sq->id, 0, 9, 0,
+			 rte_cpu_to_be_32(mkey_job->imkey->id));
+	/* Set UMR WQE control seg. */
+	ucseg->mkey_mask |= rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
+				MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
+				MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
+	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
+	/* Set mkey context seg. */
+	mkc->len = rte_cpu_to_be_64(total_len);
+	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
+					(mkey_job->imkey->id & 0xff));
+	/* Set UMR pointer to data seg. */
+	iklm->address = rte_cpu_to_be_64
+				((uintptr_t)((char *)mkey_job->imkey_array));
+	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
+	iklm->byte_count = rte_cpu_to_be_32(klm_align);
+	/* Clear the padding memory. */
+	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
+	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
+
+	/* Add the following RegEx WQE with fence. */
+	wqe = (struct mlx5_wqe_ctrl_seg *)
+				(((uint8_t *)wqe) + MLX5_REGEX_UMR_WQE_SIZE);
+	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
+}
+
+static inline void
+prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
+		       size_t pi, struct mlx5_klm *klm)
+{
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << 2);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+
+	/* Clear the WQE memory used as UMR WQE previously. */
+	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) != MLX5_OPCODE_NOP)
+		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
+	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq->id,
+			 0, 12, 0, 0);
+	__prep_one(priv, sq, op, job, pi, klm);
+}
+
+static inline void
+prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
+{
+	struct mlx5_regex_job *job = NULL;
+	size_t sqid = sq->sqn, mkey_job_id = 0;
+	size_t left_ops = nb_ops;
+	uint32_t klm_num = 0, len;
+	struct mlx5_klm *mkey_klm = NULL;
+	struct mlx5_klm klm;
+
+	sqid = sq->sqn;
+	while (left_ops--)
+		rte_prefetch0(op[left_ops]);
+	left_ops = nb_ops;
+	/*
+	 * Build the WQE set by reverse. In case the burst may consume
+	 * multiple mkeys, build the WQE set as normal will hard to
+	 * address the last mkey index, since we will only know the last
+	 * RegEx WQE's index when finishes building.
+	 */
+	while (left_ops--) {
+		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
+		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
+
+		if (mbuf->nb_segs > 1) {
+			size_t scatter_size = 0;
+
+			if (!mkey_klm_available(mkey_klm, klm_num,
+						mbuf->nb_segs)) {
+				/*
+				 * The mkey's KLM is full, create the UMR
+				 * WQE in the next WQE set.
+				 */
+				if (mkey_klm)
+					complete_umr_wqe(qp, sq,
+						&qp->jobs[mkey_job_id],
+						MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
+						klm_num, len);
+				/*
+				 * Get the indircet mkey and KLM array index
+				 * from the last WQE set.
+				 */
+				mkey_job_id = job_id_get(sqid,
+							 sq_size_get(sq), pi);
+				mkey_klm = qp->jobs[mkey_job_id].imkey_array;
+				klm_num = 0;
+				len = 0;
+			}
+			/* Build RegEx WQE's data segment KLM. */
+			klm.address = len;
+			klm.mkey = rte_cpu_to_be_32
+					(qp->jobs[mkey_job_id].imkey->id);
+			while (mbuf) {
+				/* Build indirect mkey seg's KLM. */
+				mkey_klm->mkey = mlx5_mr_addr2mr_bh(priv->pd,
+					NULL, &priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+				mkey_klm->address = rte_cpu_to_be_64
+					(rte_pktmbuf_mtod(mbuf, uintptr_t));
+				mkey_klm->byte_count = rte_cpu_to_be_32
+						(rte_pktmbuf_data_len(mbuf));
+				/*
+				 * Save the mbuf's total size for RegEx data
+				 * segment.
+				 */
+				scatter_size += rte_pktmbuf_data_len(mbuf);
+				mkey_klm++;
+				klm_num++;
+				mbuf = mbuf->next;
+			}
+			len += scatter_size;
+			klm.byte_count = scatter_size;
+		} else {
+			/* The single mubf case. Build the KLM directly. */
+			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
+					&priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
+			klm.byte_count = rte_pktmbuf_data_len(mbuf);
+		}
+		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
+		/*
+		 * Build the nop + RegEx WQE set by default. The fist nop WQE
+		 * will be updated later as UMR WQE if scattered mubf exist.
+		 */
+		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
+	}
+	/*
+	 * Scattered mbuf have been added to the KLM array. Complete the build
+	 * of UMR WQE, update the first nop WQE as UMR WQE.
+	 */
+	if (mkey_klm)
+		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
+				 klm_num, len);
+	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
+	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
+}
+
+uint16_t
+mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+			  struct rte_regex_ops **ops, uint16_t nb_ops)
+{
+	struct mlx5_regex_priv *priv = dev->data->dev_private;
+	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
+	struct mlx5_regex_sq *sq;
+	size_t sqid, nb_left = nb_ops, nb_desc;
+
+	while ((sqid = ffs(queue->free_sqs))) {
+		sqid--; /* ffs returns 1 for bit 0 */
+		sq = &queue->sqs[sqid];
+		nb_desc = get_free(sq);
+		if (nb_desc) {
+			/* The ops be handled can't exceed nb_ops. */
+			if (nb_desc > nb_left)
+				nb_desc = nb_left;
+			else
+				queue->free_sqs &= ~(1 << sqid);
+			prep_regex_umr_wqe_set(priv, queue, sq, ops, nb_desc);
+			send_doorbell(priv, sq);
+			nb_left -= nb_desc;
+		}
+		if (!nb_left)
+			break;
+		ops += nb_desc;
+	}
+	nb_ops -= nb_left;
+	queue->pi += nb_ops;
+	return nb_ops;
+}
+#endif
+
 uint16_t
 mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		      struct rte_regex_ops **ops, uint16_t nb_ops)
@@ -186,17 +418,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (can_send(sq)) {
+		while (get_free(sq)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
 			if (unlikely(i == nb_ops)) {
-				send_doorbell(priv->uar, sq);
+				send_doorbell(priv, sq);
 				goto out;
 			}
 		}
 		queue->free_sqs &= ~(1 << sqid);
-		send_doorbell(priv->uar, sq);
+		send_doorbell(priv, sq);
 	}
 
 out:
@@ -308,6 +540,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			  MLX5_REGEX_MAX_WQE_INDEX;
 		size_t sqid = cqe->rsvd3[2];
 		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
+
+		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
+		if (priv->has_umr)
+			wq_counter >>= 2;
 		while (sq->ci != wq_counter) {
 			if (unlikely(i == nb_ops)) {
 				/* Return without updating cq->ci */
@@ -316,7 +552,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
 						     sq->ci);
 			extract_result(ops[i], &queue->jobs[job_id]);
-			sq->ci = (sq->ci + 1) & MLX5_REGEX_MAX_WQE_INDEX;
+			sq->ci = (sq->ci + 1) & (priv->has_umr ?
+				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+				  MLX5_REGEX_MAX_WQE_INDEX);
 			i++;
 		}
 		cq->ci = (cq->ci + 1) & 0xffffff;
@@ -331,7 +569,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 }
 
 static void
-setup_sqs(struct mlx5_regex_qp *queue)
+setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
 {
 	size_t sqid, entry;
 	uint32_t job_id;
@@ -342,6 +580,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
 			job_id = sqid * sq_size_get(sq) + entry;
 			struct mlx5_regex_job *job = &queue->jobs[job_id];
 
+			/* Fill UMR WQE with NOP in advanced. */
+			if (priv->has_umr) {
+				set_wqe_ctrl_seg
+					((struct mlx5_wqe_ctrl_seg *)wqe,
+					 entry * 2, MLX5_OPCODE_NOP, 0,
+					 sq->sq_obj.sq->id, 0, 12, 0, 0);
+				wqe += MLX5_REGEX_UMR_WQE_SIZE;
+			}
 			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
 					 (wqe + MLX5_REGEX_WQE_METADATA_OFFSET),
 					 0, queue->metadata->lkey,
@@ -358,8 +604,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
 }
 
 static int
-setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
+setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
 {
+	struct ibv_pd *pd = priv->pd;
 	uint32_t i;
 	int err;
 
@@ -395,6 +642,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		goto err_output;
 	}
 
+	if (priv->has_umr) {
+		ptr = rte_calloc(__func__, qp->nb_desc, MLX5_REGEX_KLMS_SIZE,
+				 MLX5_REGEX_KLMS_SIZE);
+		if (!ptr) {
+			err = -ENOMEM;
+			goto err_imkey;
+		}
+		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
+					MLX5_REGEX_KLMS_SIZE * qp->nb_desc,
+					IBV_ACCESS_LOCAL_WRITE);
+		if (!qp->imkey_addr) {
+			rte_free(ptr);
+			DRV_LOG(ERR, "Failed to register output");
+			err = -EINVAL;
+			goto err_imkey;
+		}
+	}
+
 	/* distribute buffers to jobs */
 	for (i = 0; i < qp->nb_desc; i++) {
 		qp->jobs[i].output =
@@ -403,9 +668,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		qp->jobs[i].metadata =
 			(uint8_t *)qp->metadata->addr +
 			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
+		if (qp->imkey_addr)
+			qp->jobs[i].imkey_array = (struct mlx5_klm *)
+				qp->imkey_addr->addr +
+				(i % qp->nb_desc) * MLX5_REGEX_MAX_KLM_NUM;
 	}
+
 	return 0;
 
+err_imkey:
+	ptr = qp->outputs->addr;
+	rte_free(ptr);
+	mlx5_glue->dereg_mr(qp->outputs);
 err_output:
 	ptr = qp->metadata->addr;
 	rte_free(ptr);
@@ -417,23 +691,57 @@ int
 mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
-	int err;
+	struct mlx5_klm klm = { 0 };
+	struct mlx5_devx_mkey_attr attr = {
+		.klm_array = &klm,
+		.klm_num = 1,
+		.umr_en = 1,
+	};
+	uint32_t i;
+	int err = 0;
 
 	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
 	if (!qp->jobs)
 		return -ENOMEM;
-	err = setup_buffers(qp, priv->pd);
+	err = setup_buffers(priv, qp);
 	if (err) {
 		rte_free(qp->jobs);
 		return err;
 	}
-	setup_sqs(qp);
-	return 0;
+
+	setup_sqs(priv, qp);
+
+	if (priv->has_umr) {
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+		if (regex_get_pdn(priv->pd, &attr.pd)) {
+			err = -rte_errno;
+			DRV_LOG(ERR, "Failed to get pdn.");
+			mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			return err;
+		}
+#endif
+		for (i = 0; i < qp->nb_desc; i++) {
+			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
+			attr.klm_array = qp->jobs[i].imkey_array;
+			qp->jobs[i].imkey = mlx5_devx_cmd_mkey_create(priv->ctx,
+								      &attr);
+			if (!qp->jobs[i].imkey) {
+				err = -rte_errno;
+				DRV_LOG(ERR, "Failed to allocate imkey.");
+				mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			}
+		}
+	}
+	return err;
 }
 
 static void
 free_buffers(struct mlx5_regex_qp *qp)
 {
+	if (qp->imkey_addr) {
+		mlx5_glue->dereg_mr(qp->imkey_addr);
+		rte_free(qp->imkey_addr->addr);
+	}
 	if (qp->metadata) {
 		mlx5_glue->dereg_mr(qp->metadata);
 		rte_free(qp->metadata->addr);
@@ -448,8 +756,14 @@ void
 mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
+	uint32_t i;
 
 	if (qp) {
+		for (i = 0; i < qp->nb_desc; i++) {
+			if (qp->jobs[i].imkey)
+				claim_zero(mlx5_devx_cmd_destroy
+							(qp->jobs[i].imkey));
+		}
 		free_buffers(qp);
 		if (qp->jobs)
 			rte_free(qp->jobs);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v4 3/4] app/test-regex: support scattered mbuf input
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 1/4] common/mlx5: add user memory registration bits Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-03-31  7:37   ` Suanming Mou
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-31  7:37 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commits adds the scattered mbuf input support.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 app/test-regex/main.c          | 134 +++++++++++++++++++++++++++------
 doc/guides/tools/testregex.rst |   3 +
 2 files changed, 112 insertions(+), 25 deletions(-)

diff --git a/app/test-regex/main.c b/app/test-regex/main.c
index aea4fa6b88..82cffaacfa 100644
--- a/app/test-regex/main.c
+++ b/app/test-regex/main.c
@@ -35,6 +35,7 @@ enum app_args {
 	ARG_NUM_OF_ITERATIONS,
 	ARG_NUM_OF_QPS,
 	ARG_NUM_OF_LCORES,
+	ARG_NUM_OF_MBUF_SEGS,
 };
 
 struct job_ctx {
@@ -70,6 +71,7 @@ struct regex_conf {
 	char *data_buf;
 	long data_len;
 	long job_len;
+	uint32_t nb_segs;
 };
 
 static void
@@ -82,14 +84,15 @@ usage(const char *prog_name)
 		" --perf N: only outputs the performance data\n"
 		" --nb_iter N: number of iteration to run\n"
 		" --nb_qps N: number of queues to use\n"
-		" --nb_lcores N: number of lcores to use\n",
+		" --nb_lcores N: number of lcores to use\n"
+		" --nb_segs N: number of mbuf segments\n",
 		prog_name);
 }
 
 static void
 args_parse(int argc, char **argv, char *rules_file, char *data_file,
 	   uint32_t *nb_jobs, bool *perf_mode, uint32_t *nb_iterations,
-	   uint32_t *nb_qps, uint32_t *nb_lcores)
+	   uint32_t *nb_qps, uint32_t *nb_lcores, uint32_t *nb_segs)
 {
 	char **argvopt;
 	int opt;
@@ -111,6 +114,8 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		{ "nb_qps", 1, 0, ARG_NUM_OF_QPS},
 		/* Number of lcores. */
 		{ "nb_lcores", 1, 0, ARG_NUM_OF_LCORES},
+		/* Number of mbuf segments. */
+		{ "nb_segs", 1, 0, ARG_NUM_OF_MBUF_SEGS},
 		/* End of options */
 		{ 0, 0, 0, 0 }
 	};
@@ -150,6 +155,9 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		case ARG_NUM_OF_LCORES:
 			*nb_lcores = atoi(optarg);
 			break;
+		case ARG_NUM_OF_MBUF_SEGS:
+			*nb_segs = atoi(optarg);
+			break;
 		case ARG_HELP:
 			usage("RegEx test app");
 			break;
@@ -302,11 +310,75 @@ extbuf_free_cb(void *addr __rte_unused, void *fcb_opaque __rte_unused)
 {
 }
 
+static inline struct rte_mbuf *
+regex_create_segmented_mbuf(struct rte_mempool *mbuf_pool, int pkt_len,
+		int nb_segs, void *buf) {
+
+	struct rte_mbuf *m = NULL, *mbuf = NULL;
+	uint8_t *dst;
+	char *src = buf;
+	int data_len = 0;
+	int i, size;
+	int t_len;
+
+	if (pkt_len < 1) {
+		printf("Packet size must be 1 or more (is %d)\n", pkt_len);
+		return NULL;
+	}
+
+	if (nb_segs < 1) {
+		printf("Number of segments must be 1 or more (is %d)\n",
+				nb_segs);
+		return NULL;
+	}
+
+	t_len = pkt_len >= nb_segs ? (pkt_len / nb_segs +
+				     !!(pkt_len % nb_segs)) : 1;
+	size = pkt_len;
+
+	/* Create chained mbuf_src and fill it with buf data */
+	for (i = 0; size > 0; i++) {
+
+		m = rte_pktmbuf_alloc(mbuf_pool);
+		if (i == 0)
+			mbuf = m;
+
+		if (m == NULL) {
+			printf("Cannot create segment for source mbuf");
+			goto fail;
+		}
+
+		data_len = size > t_len ? t_len : size;
+		memset(rte_pktmbuf_mtod(m, uint8_t *), 0,
+				rte_pktmbuf_tailroom(m));
+		memcpy(rte_pktmbuf_mtod(m, uint8_t *), src, data_len);
+		dst = (uint8_t *)rte_pktmbuf_append(m, data_len);
+		if (dst == NULL) {
+			printf("Cannot append %d bytes to the mbuf\n",
+					data_len);
+			goto fail;
+		}
+
+		if (mbuf != m)
+			rte_pktmbuf_chain(mbuf, m);
+		src += data_len;
+		size -= data_len;
+
+	}
+	return mbuf;
+
+fail:
+	if (mbuf)
+		rte_pktmbuf_free(mbuf);
+	return NULL;
+}
+
 static int
 run_regex(void *args)
 {
 	struct regex_conf *rgxc = args;
 	uint32_t nb_jobs = rgxc->nb_jobs;
+	uint32_t nb_segs = rgxc->nb_segs;
 	uint32_t nb_iterations = rgxc->nb_iterations;
 	uint8_t nb_max_matches = rgxc->nb_max_matches;
 	uint32_t nb_qps = rgxc->nb_qps;
@@ -338,8 +410,12 @@ run_regex(void *args)
 	snprintf(mbuf_pool,
 		 sizeof(mbuf_pool),
 		 "mbuf_pool_%2u", qp_id_base);
-	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool, nb_jobs * nb_qps, 0,
-			0, MBUF_SIZE, rte_socket_id());
+	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool,
+			rte_align32pow2(nb_jobs * nb_qps * nb_segs),
+			0, 0, (nb_segs == 1) ? MBUF_SIZE :
+			(rte_align32pow2(job_len) / nb_segs +
+			RTE_PKTMBUF_HEADROOM),
+			rte_socket_id());
 	if (mbuf_mp == NULL) {
 		printf("Error, can't create memory pool\n");
 		return -ENOMEM;
@@ -375,8 +451,19 @@ run_regex(void *args)
 			goto end;
 		}
 
+		if (clone_buf(data_buf, &buf, data_len)) {
+			printf("Error, can't clone buf.\n");
+			res = -EXIT_FAILURE;
+			goto end;
+		}
+
+		/* Assign each mbuf with the data to handle. */
+		actual_jobs = 0;
+		pos = 0;
 		/* Allocate the jobs and assign each job with an mbuf. */
-		for (i = 0; i < nb_jobs; i++) {
+		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
+			long act_job_len = RTE_MIN(job_len, data_len - pos);
+
 			ops[i] = rte_malloc(NULL, sizeof(*ops[0]) +
 					nb_max_matches *
 					sizeof(struct rte_regexdev_match), 0);
@@ -386,30 +473,26 @@ run_regex(void *args)
 				res = -ENOMEM;
 				goto end;
 			}
-			ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+			if (nb_segs > 1) {
+				ops[i]->mbuf = regex_create_segmented_mbuf
+							(mbuf_mp, act_job_len,
+							 nb_segs, &buf[pos]);
+			} else {
+				ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+				if (ops[i]->mbuf) {
+					rte_pktmbuf_attach_extbuf(ops[i]->mbuf,
+					&buf[pos], 0, act_job_len, &shinfo);
+					ops[i]->mbuf->data_len = job_len;
+					ops[i]->mbuf->pkt_len = act_job_len;
+				}
+			}
 			if (!ops[i]->mbuf) {
-				printf("Error, can't attach mbuf.\n");
+				printf("Error, can't add mbuf.\n");
 				res = -ENOMEM;
 				goto end;
 			}
-		}
 
-		if (clone_buf(data_buf, &buf, data_len)) {
-			printf("Error, can't clone buf.\n");
-			res = -EXIT_FAILURE;
-			goto end;
-		}
-
-		/* Assign each mbuf with the data to handle. */
-		actual_jobs = 0;
-		pos = 0;
-		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
-			long act_job_len = RTE_MIN(job_len, data_len - pos);
-			rte_pktmbuf_attach_extbuf(ops[i]->mbuf, &buf[pos], 0,
-					act_job_len, &shinfo);
 			jobs_ctx[i].mbuf = ops[i]->mbuf;
-			ops[i]->mbuf->data_len = job_len;
-			ops[i]->mbuf->pkt_len = act_job_len;
 			ops[i]->user_id = i;
 			ops[i]->group_id0 = 1;
 			pos += act_job_len;
@@ -612,7 +695,7 @@ main(int argc, char **argv)
 	char *data_buf;
 	long data_len;
 	long job_len;
-	uint32_t nb_lcores = 1;
+	uint32_t nb_lcores = 1, nb_segs = 1;
 	struct regex_conf *rgxc;
 	uint32_t i;
 	struct qps_per_lcore *qps_per_lcore;
@@ -626,7 +709,7 @@ main(int argc, char **argv)
 	if (argc > 1)
 		args_parse(argc, argv, rules_file, data_file, &nb_jobs,
 				&perf_mode, &nb_iterations, &nb_qps,
-				&nb_lcores);
+				&nb_lcores, &nb_segs);
 
 	if (nb_qps == 0)
 		rte_exit(EXIT_FAILURE, "Number of QPs must be greater than 0\n");
@@ -656,6 +739,7 @@ main(int argc, char **argv)
 	for (i = 0; i < nb_lcores; i++) {
 		rgxc[i] = (struct regex_conf){
 			.nb_jobs = nb_jobs,
+			.nb_segs = nb_segs,
 			.perf_mode = perf_mode,
 			.nb_iterations = nb_iterations,
 			.nb_max_matches = nb_max_matches,
diff --git a/doc/guides/tools/testregex.rst b/doc/guides/tools/testregex.rst
index a59acd919f..cdb1ffd6ee 100644
--- a/doc/guides/tools/testregex.rst
+++ b/doc/guides/tools/testregex.rst
@@ -68,6 +68,9 @@ Application Options
 ``--nb_iter N``
   number of iteration to run
 
+``--nb_segs N``
+  number of mbuf segment
+
 ``--help``
   print application options
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v4 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
                     ` (2 preceding siblings ...)
  2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 3/4] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-03-31  7:37   ` Suanming Mou
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-03-31  7:37 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland, John Hurley

From: John Hurley <jhurley@nvidia.com>

A recent change adds support for scattered mbuf and UMR support for regex.
Part of this commit makes the pi and ci counters of the regex_sq a quarter
of the length in non umr mode, effectively moving them from 16 bits to
14. The new get_free method casts the difference in pi and ci to a 16 bit
value when calculating the free send queues, accounting for any wrapping
when pi has looped back to 0 but ci has not yet. However, the move to 14
bits while still casting to 16 can now lead to corrupted, large values
returned.

Modify the get_free function to take in the has_umr flag and, accordingly,
account for wrapping on either 14 or 16 bit pi/ci difference.

Fixes: 017f097021a6 ("regex/mlx5: add data path scattered mbuf process")
Signed-off-by: John Hurley <jhurley@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index 4f9402c583..b57e7d7794 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -192,8 +192,10 @@ send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 }
 
 static inline int
-get_free(struct mlx5_regex_sq *sq) {
-	return (sq_size_get(sq) - (uint16_t)(sq->pi - sq->ci));
+get_free(struct mlx5_regex_sq *sq, uint8_t has_umr) {
+	return (sq_size_get(sq) - ((sq->pi - sq->ci) &
+			(has_umr ? (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+			MLX5_REGEX_MAX_WQE_INDEX)));
 }
 
 static inline uint32_t
@@ -385,7 +387,7 @@ mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		nb_desc = get_free(sq);
+		nb_desc = get_free(sq, priv->has_umr);
 		if (nb_desc) {
 			/* The ops be handled can't exceed nb_ops. */
 			if (nb_desc > nb_left)
@@ -418,7 +420,7 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (get_free(sq)) {
+		while (get_free(sq, priv->has_umr)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
@ 2021-04-06 16:22     ` Thomas Monjalon
  2021-04-07  1:00       ` Suanming Mou
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Monjalon @ 2021-04-06 16:22 UTC (permalink / raw)
  To: John Hurley, Suanming Mou; +Cc: orika, dev, viacheslavo, matan, rasland

30/03/2021 03:39, Suanming Mou:
> From: John Hurley <jhurley@nvidia.com>
> 
> A recent change adds support for scattered mbuf and UMR support for regex.
> Part of this commit makes the pi and ci counters of the regex_sq a quarter
> of the length in non umr mode, effectively moving them from 16 bits to
> 14. The new get_free method casts the difference in pi and ci to a 16 bit
> value when calculating the free send queues, accounting for any wrapping
> when pi has looped back to 0 but ci has not yet. However, the move to 14
> bits while still casting to 16 can now lead to corrupted, large values
> returned.
> 
> Modify the get_free function to take in the has_umr flag and, accordingly,
> account for wrapping on either 14 or 16 bit pi/ci difference.
> 
> Fixes: d55c9f637263 ("regex/mlx5: add data path scattered mbuf process")

It is fixing a patch in this series, right?
Why not squashing them?




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-04-06 16:22     ` Thomas Monjalon
@ 2021-04-07  1:00       ` Suanming Mou
  2021-04-07  7:11         ` Thomas Monjalon
  0 siblings, 1 reply; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  1:00 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon, John Hurley
  Cc: Ori Kam, dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh


> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, April 7, 2021 12:23 AM
> To: John Hurley <jhurley@nvidia.com>; Suanming Mou
> <suanmingm@nvidia.com>
> Cc: Ori Kam <orika@nvidia.com>; dev@dpdk.org; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>; Raslan
> Darawsheh <rasland@nvidia.com>
> Subject: Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation
> of free sqs in umr mode
> 
> 30/03/2021 03:39, Suanming Mou:
> > From: John Hurley <jhurley@nvidia.com>
> >
> > A recent change adds support for scattered mbuf and UMR support for regex.
> > Part of this commit makes the pi and ci counters of the regex_sq a
> > quarter of the length in non umr mode, effectively moving them from 16
> > bits to 14. The new get_free method casts the difference in pi and ci
> > to a 16 bit value when calculating the free send queues, accounting
> > for any wrapping when pi has looped back to 0 but ci has not yet.
> > However, the move to 14 bits while still casting to 16 can now lead to
> > corrupted, large values returned.
> >
> > Modify the get_free function to take in the has_umr flag and,
> > accordingly, account for wrapping on either 14 or 16 bit pi/ci difference.
> >
> > Fixes: d55c9f637263 ("regex/mlx5: add data path scattered mbuf
> > process")
> 
> It is fixing a patch in this series, right?
> Why not squashing them?

Yes, this is a fix for this series. 
This fix was done by John when he tested the code, so I put it as an individual one.
Should we update an new version to squash it?

(And Thomas, the latest version of this series is v4, you comment in this old v3 version now :) )

> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-04-07  1:00       ` Suanming Mou
@ 2021-04-07  7:11         ` Thomas Monjalon
  2021-04-07  7:14           ` Suanming Mou
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Monjalon @ 2021-04-07  7:11 UTC (permalink / raw)
  To: John Hurley, Suanming Mou
  Cc: dev, Ori Kam, dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh

07/04/2021 03:00, Suanming Mou:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 30/03/2021 03:39, Suanming Mou:
> > > From: John Hurley <jhurley@nvidia.com>
> > >
> > > A recent change adds support for scattered mbuf and UMR support for regex.
> > > Part of this commit makes the pi and ci counters of the regex_sq a
> > > quarter of the length in non umr mode, effectively moving them from 16
> > > bits to 14. The new get_free method casts the difference in pi and ci
> > > to a 16 bit value when calculating the free send queues, accounting
> > > for any wrapping when pi has looped back to 0 but ci has not yet.
> > > However, the move to 14 bits while still casting to 16 can now lead to
> > > corrupted, large values returned.
> > >
> > > Modify the get_free function to take in the has_umr flag and,
> > > accordingly, account for wrapping on either 14 or 16 bit pi/ci difference.
> > >
> > > Fixes: d55c9f637263 ("regex/mlx5: add data path scattered mbuf
> > > process")
> > 
> > It is fixing a patch in this series, right?
> > Why not squashing them?
> 
> Yes, this is a fix for this series. 
> This fix was done by John when he tested the code, so I put it as an individual one.
> Should we update an new version to squash it?

Yes better to squash it in a v5, thanks.

> (And Thomas, the latest version of this series is v4, you comment in this old v3 version now :) )

Yes, sorry.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode
  2021-04-07  7:11         ` Thomas Monjalon
@ 2021-04-07  7:14           ` Suanming Mou
  0 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  7:14 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon, John Hurley
  Cc: dev, Ori Kam, dev, Slava Ovsiienko, Matan Azrad, Raslan Darawsheh



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, April 7, 2021 3:12 PM
> To: John Hurley <jhurley@nvidia.com>; Suanming Mou
> <suanmingm@nvidia.com>
> Cc: dev@dpdk.org; Ori Kam <orika@nvidia.com>; dev@dpdk.org; Slava
> Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation
> of free sqs in umr mode
> 
> 07/04/2021 03:00, Suanming Mou:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 30/03/2021 03:39, Suanming Mou:
> > > > From: John Hurley <jhurley@nvidia.com>
> > > >
> > > > A recent change adds support for scattered mbuf and UMR support for
> regex.
> > > > Part of this commit makes the pi and ci counters of the regex_sq a
> > > > quarter of the length in non umr mode, effectively moving them
> > > > from 16 bits to 14. The new get_free method casts the difference
> > > > in pi and ci to a 16 bit value when calculating the free send
> > > > queues, accounting for any wrapping when pi has looped back to 0 but ci
> has not yet.
> > > > However, the move to 14 bits while still casting to 16 can now
> > > > lead to corrupted, large values returned.
> > > >
> > > > Modify the get_free function to take in the has_umr flag and,
> > > > accordingly, account for wrapping on either 14 or 16 bit pi/ci difference.
> > > >
> > > > Fixes: d55c9f637263 ("regex/mlx5: add data path scattered mbuf
> > > > process")
> > >
> > > It is fixing a patch in this series, right?
> > > Why not squashing them?
> >
> > Yes, this is a fix for this series.
> > This fix was done by John when he tested the code, so I put it as an individual
> one.
> > Should we update an new version to squash it?
> 
> Yes better to squash it in a v5, thanks.

OK, I see, thanks. Will update later.

> 
> > (And Thomas, the latest version of this series is v4, you comment in
> > this old v3 version now :) )
> 
> Yes, sorry.
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf
  2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                   ` (6 preceding siblings ...)
  2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-04-07  7:21 ` Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 1/3] common/mlx5: add user memory registration bits Suanming Mou
                     ` (3 more replies)
  7 siblings, 4 replies; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  7:21 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

The scattered mbuf was not supported in mlx5 RegEx driver. This
patch set adds the support of scattered mbuf by UMR WQE.

UMR(User-Mode Memory Registration) WQE can present data buffers
scattered within multiple mbufs with single indirect mkey. Take
advantage of the UMR WQE, scattered mbuf in one operation can be
presented to an indirect mkey. The RegEx which only accepts one
mkey can now process the whole scattered mbuf in one operation.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. The mbufs from multiple operations can be combined
into one UMR WQE as well if there is enough space in the KLM array,
since the operations can address their own mbuf's content by the
mkey's address and length. However, one operation's scattered mbuf's
can't be placed in two different UMR WQE's KLM array, if the UMR
WQE's KLM does not has enough free space for one operation, the
extra UMR WQE will be engaged.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the mkey's index used by the UMR WQE should be the index
of last the RegEX WQE in the operations. As one operation consumes
one WQE set, build the RegEx WQE by reverse helps address the mkey
more efficiently. Once the operations in one burst consumes multiple
mkeys, when the mkey KLM array is full, the reverse WQE set index
will always be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
WQE by interleave. The UMR and RegEx WQE can be called as WQE set.
The SQ's pi and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

v5:
1. squash the previous fix patch.

v4:
1. git log improvement.

v3:
1. Move testregex.rst change to the correct commit.
2. Code rebase to the latest version.

v2:
1. Check mbuf multiple seg by nb_segs.
2. Add ops prefetch.
3. Allocate ops and mbuf memory together in test application.


Suanming Mou (3):
  common/mlx5: add user memory registration bits
  regex/mlx5: add data path scattered mbuf process
  app/test-regex: support scattered mbuf input

 app/test-regex/main.c                    | 134 ++++++--
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 doc/guides/tools/testregex.rst           |   3 +
 drivers/common/mlx5/linux/meson.build    |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.c     |   5 +
 drivers/common/mlx5/mlx5_devx_cmds.h     |   3 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 380 +++++++++++++++++++++--
 11 files changed, 531 insertions(+), 83 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v5 1/3] common/mlx5: add user memory registration bits
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
@ 2021-04-07  7:21   ` Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  7:21 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commit adds the UMR capability bits.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 2 ++
 drivers/common/mlx5/mlx5_devx_cmds.c  | 5 +++++
 drivers/common/mlx5/mlx5_devx_cmds.h  | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 220de35420..5d6a861689 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -186,6 +186,8 @@ has_sym_args = [
 	'mlx5dv_dr_action_create_aso' ],
 	[ 'HAVE_INFINIBAND_VERBS_H', 'infiniband/verbs.h',
 	'INFINIBAND_VERBS_H' ],
+        [ 'HAVE_MLX5_UMR_IMKEY', 'infiniband/mlx5dv.h',
+        'MLX5_WQE_UMR_CTRL_FLAG_INLINE' ],
 ]
 config = configuration_data()
 foreach arg:has_sym_args
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c90e020643..268bcd0d99 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -266,6 +266,7 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, attr->pd);
 	MLX5_SET(mkc, mkc, mkey_7_0, attr->umem_id & 0xFF);
+	MLX5_SET(mkc, mkc, umr_en, attr->umr_en);
 	MLX5_SET(mkc, mkc, translations_octword_size, translation_size);
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
@@ -752,6 +753,10 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						mini_cqe_resp_flow_tag);
 	attr->mini_cqe_resp_l3_l4_tag = MLX5_GET(cmd_hca_cap, hcattr,
 						 mini_cqe_resp_l3_l4_tag);
+	attr->umr_indirect_mkey_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_indirect_mkey_disabled);
+	attr->umr_modify_entity_size_disabled =
+		MLX5_GET(cmd_hca_cap, hcattr, umr_modify_entity_size_disabled);
 	if (attr->qos.sup) {
 		MLX5_SET(query_hca_cap_in, in, op_mod,
 			 MLX5_GET_HCA_CAP_OP_MOD_QOS_CAP |
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 2826c0b2c6..67b5f771c6 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -31,6 +31,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t umr_en:1;
 	struct mlx5_klm *klm_array;
 	int klm_num;
 };
@@ -151,6 +152,8 @@ struct mlx5_hca_attr {
 	uint32_t log_max_mmo_dma:5;
 	uint32_t log_max_mmo_compress:5;
 	uint32_t log_max_mmo_decompress:5;
+	uint32_t umr_modify_entity_size_disabled:1;
+	uint32_t umr_indirect_mkey_disabled:1;
 };
 
 struct mlx5_devx_wq_attr {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v5 2/3] regex/mlx5: add data path scattered mbuf process
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 1/3] common/mlx5: add user memory registration bits Suanming Mou
@ 2021-04-07  7:21   ` Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 3/3] app/test-regex: support scattered mbuf input Suanming Mou
  2021-04-08 20:53   ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  7:21 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland, John Hurley

UMR(User-Mode Memory Registration) WQE can present data buffers
scattered within multiple mbufs with single indirect mkey. Take
advantage of the UMR WQE, scattered mbuf in one operation can be
presented to an indirect mkey. The RegEx which only accepts one
mkey can now process the whole scattered mbuf in one operation.

The maximum scattered mbuf can be supported in one UMR WQE is now
defined as 64. The mbufs from multiple operations can be combined
into one UMR WQE as well if there is enough space in the KLM array,
since the operations can address their own mbuf's content by the
mkey's address and length. However, one operation's scattered mbuf's
can't be placed in two different UMR WQE's KLM array, if the UMR
WQE's KLM does not has enough free space for one operation, the
extra UMR WQE will be engaged.

In case the UMR WQE's indirect mkey will be over wrapped by the SQ's
WQE move, the mkey's index used by the UMR WQE should be the index
of last the RegEX WQE in the operations. As one operation consumes
one WQE set, build the RegEx WQE by reverse helps address the mkey
more efficiently. Once the operations in one burst consumes multiple
mkeys, when the mkey KLM array is full, the reverse WQE set index
will always be the last of the new mkey's for the new UMR WQE.

In GGA mode, the SQ WQE's memory layout becomes UMR/NOP and RegEx
WQE by interleave. The UMR and RegEx WQE can be called as WQE set.
The SQ's pi and ci will also be increased as WQE set not as WQE.

For operations don't have scattered mbuf, uses the mbuf's mkey directly,
the WQE set combination is NOP + RegEx.
For operations have scattered mubf but share the UMR WQE with others,
the WQE set combination is NOP + RegEx.
For operations complete the UMR WQE, the WQE set combination is UMR +
RegEx.

Signed-off-by: John Hurley <jhurley@nvidia.com>
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 doc/guides/regexdevs/mlx5.rst            |   5 +
 doc/guides/rel_notes/release_21_05.rst   |   4 +
 drivers/regex/mlx5/mlx5_regex.c          |   9 +
 drivers/regex/mlx5/mlx5_regex.h          |  26 +-
 drivers/regex/mlx5/mlx5_regex_control.c  |  43 ++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 380 +++++++++++++++++++++--
 6 files changed, 409 insertions(+), 58 deletions(-)

diff --git a/doc/guides/regexdevs/mlx5.rst b/doc/guides/regexdevs/mlx5.rst
index faaa6ac11d..45a0b96980 100644
--- a/doc/guides/regexdevs/mlx5.rst
+++ b/doc/guides/regexdevs/mlx5.rst
@@ -35,6 +35,11 @@ be specified as device parameter. The RegEx device can be probed and used with
 other Mellanox devices, by adding more options in the class.
 For example: ``class=net:regex`` will probe both the net PMD and the RegEx PMD.
 
+Features
+--------
+
+- Multi segments mbuf support.
+
 Supported NICs
 --------------
 
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 873140b433..3b4b034d35 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -126,6 +126,10 @@ New Features
   * Added command to display Rx queue used descriptor count.
     ``show port (port_id) rxq (queue_id) desc used count``
 
+* **Updated Mellanox RegEx PMD.**
+
+  * Added support for multi segments mbuf.
+
 
 Removed Items
 -------------
diff --git a/drivers/regex/mlx5/mlx5_regex.c b/drivers/regex/mlx5/mlx5_regex.c
index ac5b205fa9..82c485e50c 100644
--- a/drivers/regex/mlx5/mlx5_regex.c
+++ b/drivers/regex/mlx5/mlx5_regex.c
@@ -199,6 +199,13 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	}
 	priv->regexdev->dev_ops = &mlx5_regexdev_ops;
 	priv->regexdev->enqueue = mlx5_regexdev_enqueue;
+#ifdef HAVE_MLX5_UMR_IMKEY
+	if (!attr.umr_indirect_mkey_disabled &&
+	    !attr.umr_modify_entity_size_disabled)
+		priv->has_umr = 1;
+	if (priv->has_umr)
+		priv->regexdev->enqueue = mlx5_regexdev_enqueue_gga;
+#endif
 	priv->regexdev->dequeue = mlx5_regexdev_dequeue;
 	priv->regexdev->device = (struct rte_device *)pci_dev;
 	priv->regexdev->data->dev_private = priv;
@@ -213,6 +220,8 @@ mlx5_regex_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	    rte_errno = ENOMEM;
 		goto error;
 	}
+	DRV_LOG(INFO, "RegEx GGA is %s.",
+		priv->has_umr ? "supported" : "unsupported");
 	return 0;
 
 error:
diff --git a/drivers/regex/mlx5/mlx5_regex.h b/drivers/regex/mlx5/mlx5_regex.h
index a2b3f0d9f3..51a2101e53 100644
--- a/drivers/regex/mlx5/mlx5_regex.h
+++ b/drivers/regex/mlx5/mlx5_regex.h
@@ -15,6 +15,7 @@
 #include <mlx5_common_devx.h>
 
 #include "mlx5_rxp.h"
+#include "mlx5_regex_utils.h"
 
 struct mlx5_regex_sq {
 	uint16_t log_nb_desc; /* Log 2 number of desc for this object. */
@@ -40,6 +41,7 @@ struct mlx5_regex_qp {
 	struct mlx5_regex_job *jobs;
 	struct ibv_mr *metadata;
 	struct ibv_mr *outputs;
+	struct ibv_mr *imkey_addr; /* Indirect mkey array region. */
 	size_t ci, pi;
 	struct mlx5_mr_ctrl mr_ctrl;
 };
@@ -71,8 +73,29 @@ struct mlx5_regex_priv {
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	uint8_t is_bf2; /* The device is BF2 device. */
 	uint8_t sq_ts_format; /* Whether SQ supports timestamp formats. */
+	uint8_t has_umr; /* The device supports UMR. */
 };
 
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+static inline int
+regex_get_pdn(void *pd, uint32_t *pdn)
+{
+	struct mlx5dv_obj obj;
+	struct mlx5dv_pd pd_info;
+	int ret = 0;
+
+	obj.pd.in = pd;
+	obj.pd.out = &pd_info;
+	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
+	if (ret) {
+		DRV_LOG(DEBUG, "Fail to get PD object info");
+		return ret;
+	}
+	*pdn = pd_info.pdn;
+	return 0;
+}
+#endif
+
 /* mlx5_regex.c */
 int mlx5_regex_start(struct rte_regexdev *dev);
 int mlx5_regex_stop(struct rte_regexdev *dev);
@@ -108,5 +131,6 @@ uint16_t mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
 uint16_t mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		       struct rte_regex_ops **ops, uint16_t nb_ops);
-
+uint16_t mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+		       struct rte_regex_ops **ops, uint16_t nb_ops);
 #endif /* MLX5_REGEX_H */
diff --git a/drivers/regex/mlx5/mlx5_regex_control.c b/drivers/regex/mlx5/mlx5_regex_control.c
index 55fbb419ed..eef0fe579d 100644
--- a/drivers/regex/mlx5/mlx5_regex_control.c
+++ b/drivers/regex/mlx5/mlx5_regex_control.c
@@ -27,6 +27,9 @@
 
 #define MLX5_REGEX_NUM_WQE_PER_PAGE (4096/64)
 
+#define MLX5_REGEX_WQE_LOG_NUM(has_umr, log_desc) \
+		((has_umr) ? ((log_desc) + 2) : (log_desc))
+
 /**
  * Returns the number of qp obj to be created.
  *
@@ -91,26 +94,6 @@ regex_ctrl_create_cq(struct mlx5_regex_priv *priv, struct mlx5_regex_cq *cq)
 	return 0;
 }
 
-#ifdef HAVE_IBV_FLOW_DV_SUPPORT
-static int
-regex_get_pdn(void *pd, uint32_t *pdn)
-{
-	struct mlx5dv_obj obj;
-	struct mlx5dv_pd pd_info;
-	int ret = 0;
-
-	obj.pd.in = pd;
-	obj.pd.out = &pd_info;
-	ret = mlx5_glue->dv_init_obj(&obj, MLX5DV_OBJ_PD);
-	if (ret) {
-		DRV_LOG(DEBUG, "Fail to get PD object info");
-		return ret;
-	}
-	*pdn = pd_info.pdn;
-	return 0;
-}
-#endif
-
 /**
  * Destroy the SQ object.
  *
@@ -168,14 +151,16 @@ regex_ctrl_create_sq(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	int ret;
 
 	sq->log_nb_desc = log_nb_desc;
+	sq->sqn = q_ind;
 	sq->ci = 0;
 	sq->pi = 0;
 	ret = regex_get_pdn(priv->pd, &pd_num);
 	if (ret)
 		return ret;
 	attr.wq_attr.pd = pd_num;
-	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj, log_nb_desc, &attr,
-				  SOCKET_ID_ANY);
+	ret = mlx5_devx_sq_create(priv->ctx, &sq->sq_obj,
+			MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_nb_desc),
+			&attr, SOCKET_ID_ANY);
 	if (ret) {
 		DRV_LOG(ERR, "Can't create SQ object.");
 		rte_errno = ENOMEM;
@@ -225,10 +210,18 @@ mlx5_regex_qp_setup(struct rte_regexdev *dev, uint16_t qp_ind,
 
 	qp = &priv->qps[qp_ind];
 	qp->flags = cfg->qp_conf_flags;
-	qp->cq.log_nb_desc = rte_log2_u32(cfg->nb_desc);
-	qp->nb_desc = 1 << qp->cq.log_nb_desc;
+	log_desc = rte_log2_u32(cfg->nb_desc);
+	/*
+	 * UMR mode requires two WQEs(UMR and RegEx WQE) for one descriptor.
+	 * For CQ, expand the CQE number multiple with 2.
+	 * For SQ, the UMR and RegEx WQE for one descriptor consumes 4 WQEBBS,
+	 * expand the WQE number multiple with 4.
+	 */
+	qp->cq.log_nb_desc = log_desc + (!!priv->has_umr);
+	qp->nb_desc = 1 << log_desc;
 	if (qp->flags & RTE_REGEX_QUEUE_PAIR_CFG_OOS_F)
-		qp->nb_obj = regex_ctrl_get_nb_obj(qp->nb_desc);
+		qp->nb_obj = regex_ctrl_get_nb_obj
+			(1 << MLX5_REGEX_WQE_LOG_NUM(priv->has_umr, log_desc));
 	else
 		qp->nb_obj = 1;
 	qp->sqs = rte_malloc(NULL,
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index beaea7b63f..b57e7d7794 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -32,6 +32,15 @@
 #define MLX5_REGEX_WQE_GATHER_OFFSET 32
 #define MLX5_REGEX_WQE_SCATTER_OFFSET 48
 #define MLX5_REGEX_METADATA_OFF 32
+#define MLX5_REGEX_UMR_WQE_SIZE 192
+/* The maximum KLMs can be added to one UMR indirect mkey. */
+#define MLX5_REGEX_MAX_KLM_NUM 128
+/* The KLM array size for one job. */
+#define MLX5_REGEX_KLMS_SIZE \
+	((MLX5_REGEX_MAX_KLM_NUM) * sizeof(struct mlx5_klm))
+/* In WQE set mode, the pi should be quarter of the MLX5_REGEX_MAX_WQE_INDEX. */
+#define MLX5_REGEX_UMR_SQ_PI_IDX(pi, ops) \
+	(((pi) + (ops)) & (MLX5_REGEX_MAX_WQE_INDEX >> 2))
 
 static inline uint32_t
 sq_size_get(struct mlx5_regex_sq *sq)
@@ -49,6 +58,8 @@ struct mlx5_regex_job {
 	uint64_t user_id;
 	volatile uint8_t *output;
 	volatile uint8_t *metadata;
+	struct mlx5_klm *imkey_array; /* Indirect mkey's KLM array. */
+	struct mlx5_devx_obj *imkey; /* UMR WQE's indirect meky. */
 } __rte_cached_aligned;
 
 static inline void
@@ -99,12 +110,13 @@ set_wqe_ctrl_seg(struct mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
 }
 
 static inline void
-prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
-	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
-	 struct mlx5_regex_job *job)
+__prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+	   struct rte_regex_ops *op, struct mlx5_regex_job *job,
+	   size_t pi, struct mlx5_klm *klm)
 {
-	size_t wqe_offset = (sq->pi & (sq_size_get(sq) - 1)) * MLX5_SEND_WQE_BB;
-	uint32_t lkey;
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+			    (priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint16_t group0 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID0_VALID_F ?
 				op->group_id0 : 0;
 	uint16_t group1 = op->req_flags & RTE_REGEX_OPS_REQ_GROUP_ID1_VALID_F ?
@@ -122,14 +134,11 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 			       RTE_REGEX_OPS_REQ_GROUP_ID2_VALID_F |
 			       RTE_REGEX_OPS_REQ_GROUP_ID3_VALID_F)))
 		group0 = op->group_id0;
-	lkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
-				  &priv->mr_scache, &qp->mr_ctrl,
-				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
-				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
 	int ds = 4; /*  ctrl + meta + input + output */
 
-	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe, sq->pi,
+	set_wqe_ctrl_seg((struct mlx5_wqe_ctrl_seg *)wqe,
+			 (priv->has_umr ? (pi * 4 + 3) : pi),
 			 MLX5_OPCODE_MMO, MLX5_OPC_MOD_MMO_REGEX,
 			 sq->sq_obj.sq->id, 0, ds, 0, 0);
 	set_regex_ctrl_seg(wqe + 12, 0, group0, group1, group2, group3,
@@ -137,36 +146,56 @@ prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
 	struct mlx5_wqe_data_seg *input_seg =
 		(struct mlx5_wqe_data_seg *)(wqe +
 					     MLX5_REGEX_WQE_GATHER_OFFSET);
-	input_seg->byte_count =
-		rte_cpu_to_be_32(rte_pktmbuf_data_len(op->mbuf));
-	input_seg->addr = rte_cpu_to_be_64(rte_pktmbuf_mtod(op->mbuf,
-							    uintptr_t));
-	input_seg->lkey = lkey;
+	input_seg->byte_count = rte_cpu_to_be_32(klm->byte_count);
+	input_seg->addr = rte_cpu_to_be_64(klm->address);
+	input_seg->lkey = klm->mkey;
 	job->user_id = op->user_id;
+}
+
+static inline void
+prep_one(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops *op,
+	 struct mlx5_regex_job *job)
+{
+	struct mlx5_klm klm;
+
+	klm.byte_count = rte_pktmbuf_data_len(op->mbuf);
+	klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, 0,
+				  &priv->mr_scache, &qp->mr_ctrl,
+				  rte_pktmbuf_mtod(op->mbuf, uintptr_t),
+				  !!(op->mbuf->ol_flags & EXT_ATTACHED_MBUF));
+	klm.address = rte_pktmbuf_mtod(op->mbuf, uintptr_t);
+	__prep_one(priv, sq, op, job, sq->pi, &klm);
 	sq->db_pi = sq->pi;
 	sq->pi = (sq->pi + 1) & MLX5_REGEX_MAX_WQE_INDEX;
 }
 
 static inline void
-send_doorbell(struct mlx5dv_devx_uar *uar, struct mlx5_regex_sq *sq)
+send_doorbell(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq)
 {
+	struct mlx5dv_devx_uar *uar = priv->uar;
 	size_t wqe_offset = (sq->db_pi & (sq_size_get(sq) - 1)) *
-		MLX5_SEND_WQE_BB;
+		(MLX5_SEND_WQE_BB << (priv->has_umr ? 2 : 0)) +
+		(priv->has_umr ? MLX5_REGEX_UMR_WQE_SIZE : 0);
 	uint8_t *wqe = (uint8_t *)(uintptr_t)sq->sq_obj.wqes + wqe_offset;
-	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	/* Or the fm_ce_se instead of set, avoid the fence be cleared. */
+	((struct mlx5_wqe_ctrl_seg *)wqe)->fm_ce_se |= MLX5_WQE_CTRL_CQ_UPDATE;
 	uint64_t *doorbell_addr =
 		(uint64_t *)((uint8_t *)uar->base_addr + 0x800);
 	rte_io_wmb();
-	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((sq->db_pi + 1) &
-						 MLX5_REGEX_MAX_WQE_INDEX);
+	sq->sq_obj.db_rec[MLX5_SND_DBR] = rte_cpu_to_be_32((priv->has_umr ?
+					(sq->db_pi * 4 + 3) : sq->db_pi) &
+					MLX5_REGEX_MAX_WQE_INDEX);
 	rte_wmb();
 	*doorbell_addr = *(volatile uint64_t *)wqe;
 	rte_wmb();
 }
 
 static inline int
-can_send(struct mlx5_regex_sq *sq) {
-	return ((uint16_t)(sq->pi - sq->ci) < sq_size_get(sq));
+get_free(struct mlx5_regex_sq *sq, uint8_t has_umr) {
+	return (sq_size_get(sq) - ((sq->pi - sq->ci) &
+			(has_umr ? (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+			MLX5_REGEX_MAX_WQE_INDEX)));
 }
 
 static inline uint32_t
@@ -174,6 +203,211 @@ job_id_get(uint32_t qid, size_t sq_size, size_t index) {
 	return qid * sq_size + (index & (sq_size - 1));
 }
 
+#ifdef HAVE_MLX5_UMR_IMKEY
+static inline int
+mkey_klm_available(struct mlx5_klm *klm, uint32_t pos, uint32_t new)
+{
+	return (klm && ((pos + new) <= MLX5_REGEX_MAX_KLM_NUM));
+}
+
+static inline void
+complete_umr_wqe(struct mlx5_regex_qp *qp, struct mlx5_regex_sq *sq,
+		 struct mlx5_regex_job *mkey_job,
+		 size_t umr_index, uint32_t klm_size, uint32_t total_len)
+{
+	size_t wqe_offset = (umr_index & (sq_size_get(sq) - 1)) *
+		(MLX5_SEND_WQE_BB * 4);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+	struct mlx5_wqe_umr_ctrl_seg *ucseg =
+				(struct mlx5_wqe_umr_ctrl_seg *)(wqe + 1);
+	struct mlx5_wqe_mkey_context_seg *mkc =
+				(struct mlx5_wqe_mkey_context_seg *)(ucseg + 1);
+	struct mlx5_klm *iklm = (struct mlx5_klm *)(mkc + 1);
+	uint16_t klm_align = RTE_ALIGN(klm_size, 4);
+
+	memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* Set WQE control seg. Non-inline KLM UMR WQE size must be 9 WQE_DS. */
+	set_wqe_ctrl_seg(wqe, (umr_index * 4), MLX5_OPCODE_UMR,
+			 0, sq->sq_obj.sq->id, 0, 9, 0,
+			 rte_cpu_to_be_32(mkey_job->imkey->id));
+	/* Set UMR WQE control seg. */
+	ucseg->mkey_mask |= rte_cpu_to_be_64(MLX5_WQE_UMR_CTRL_MKEY_MASK_LEN |
+				MLX5_WQE_UMR_CTRL_FLAG_TRNSLATION_OFFSET |
+				MLX5_WQE_UMR_CTRL_MKEY_MASK_ACCESS_LOCAL_WRITE);
+	ucseg->klm_octowords = rte_cpu_to_be_16(klm_align);
+	/* Set mkey context seg. */
+	mkc->len = rte_cpu_to_be_64(total_len);
+	mkc->qpn_mkey = rte_cpu_to_be_32(0xffffff00 |
+					(mkey_job->imkey->id & 0xff));
+	/* Set UMR pointer to data seg. */
+	iklm->address = rte_cpu_to_be_64
+				((uintptr_t)((char *)mkey_job->imkey_array));
+	iklm->mkey = rte_cpu_to_be_32(qp->imkey_addr->lkey);
+	iklm->byte_count = rte_cpu_to_be_32(klm_align);
+	/* Clear the padding memory. */
+	memset((uint8_t *)&mkey_job->imkey_array[klm_size], 0,
+	       sizeof(struct mlx5_klm) * (klm_align - klm_size));
+
+	/* Add the following RegEx WQE with fence. */
+	wqe = (struct mlx5_wqe_ctrl_seg *)
+				(((uint8_t *)wqe) + MLX5_REGEX_UMR_WQE_SIZE);
+	wqe->fm_ce_se |= MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE;
+}
+
+static inline void
+prep_nop_regex_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_sq *sq,
+		       struct rte_regex_ops *op, struct mlx5_regex_job *job,
+		       size_t pi, struct mlx5_klm *klm)
+{
+	size_t wqe_offset = (pi & (sq_size_get(sq) - 1)) *
+			    (MLX5_SEND_WQE_BB << 2);
+	struct mlx5_wqe_ctrl_seg *wqe = (struct mlx5_wqe_ctrl_seg *)((uint8_t *)
+				   (uintptr_t)sq->sq_obj.wqes + wqe_offset);
+
+	/* Clear the WQE memory used as UMR WQE previously. */
+	if ((rte_be_to_cpu_32(wqe->opmod_idx_opcode) & 0xff) != MLX5_OPCODE_NOP)
+		memset(wqe, 0, MLX5_REGEX_UMR_WQE_SIZE);
+	/* UMR WQE size is 9 DS, align nop WQE to 3 WQEBBS(12 DS). */
+	set_wqe_ctrl_seg(wqe, pi * 4, MLX5_OPCODE_NOP, 0, sq->sq_obj.sq->id,
+			 0, 12, 0, 0);
+	__prep_one(priv, sq, op, job, pi, klm);
+}
+
+static inline void
+prep_regex_umr_wqe_set(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp,
+	 struct mlx5_regex_sq *sq, struct rte_regex_ops **op, size_t nb_ops)
+{
+	struct mlx5_regex_job *job = NULL;
+	size_t sqid = sq->sqn, mkey_job_id = 0;
+	size_t left_ops = nb_ops;
+	uint32_t klm_num = 0, len;
+	struct mlx5_klm *mkey_klm = NULL;
+	struct mlx5_klm klm;
+
+	sqid = sq->sqn;
+	while (left_ops--)
+		rte_prefetch0(op[left_ops]);
+	left_ops = nb_ops;
+	/*
+	 * Build the WQE set by reverse. In case the burst may consume
+	 * multiple mkeys, build the WQE set as normal will hard to
+	 * address the last mkey index, since we will only know the last
+	 * RegEx WQE's index when finishes building.
+	 */
+	while (left_ops--) {
+		struct rte_mbuf *mbuf = op[left_ops]->mbuf;
+		size_t pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, left_ops);
+
+		if (mbuf->nb_segs > 1) {
+			size_t scatter_size = 0;
+
+			if (!mkey_klm_available(mkey_klm, klm_num,
+						mbuf->nb_segs)) {
+				/*
+				 * The mkey's KLM is full, create the UMR
+				 * WQE in the next WQE set.
+				 */
+				if (mkey_klm)
+					complete_umr_wqe(qp, sq,
+						&qp->jobs[mkey_job_id],
+						MLX5_REGEX_UMR_SQ_PI_IDX(pi, 1),
+						klm_num, len);
+				/*
+				 * Get the indircet mkey and KLM array index
+				 * from the last WQE set.
+				 */
+				mkey_job_id = job_id_get(sqid,
+							 sq_size_get(sq), pi);
+				mkey_klm = qp->jobs[mkey_job_id].imkey_array;
+				klm_num = 0;
+				len = 0;
+			}
+			/* Build RegEx WQE's data segment KLM. */
+			klm.address = len;
+			klm.mkey = rte_cpu_to_be_32
+					(qp->jobs[mkey_job_id].imkey->id);
+			while (mbuf) {
+				/* Build indirect mkey seg's KLM. */
+				mkey_klm->mkey = mlx5_mr_addr2mr_bh(priv->pd,
+					NULL, &priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+				mkey_klm->address = rte_cpu_to_be_64
+					(rte_pktmbuf_mtod(mbuf, uintptr_t));
+				mkey_klm->byte_count = rte_cpu_to_be_32
+						(rte_pktmbuf_data_len(mbuf));
+				/*
+				 * Save the mbuf's total size for RegEx data
+				 * segment.
+				 */
+				scatter_size += rte_pktmbuf_data_len(mbuf);
+				mkey_klm++;
+				klm_num++;
+				mbuf = mbuf->next;
+			}
+			len += scatter_size;
+			klm.byte_count = scatter_size;
+		} else {
+			/* The single mubf case. Build the KLM directly. */
+			klm.mkey = mlx5_mr_addr2mr_bh(priv->pd, NULL,
+					&priv->mr_scache, &qp->mr_ctrl,
+					rte_pktmbuf_mtod(mbuf, uintptr_t),
+					!!(mbuf->ol_flags & EXT_ATTACHED_MBUF));
+			klm.address = rte_pktmbuf_mtod(mbuf, uintptr_t);
+			klm.byte_count = rte_pktmbuf_data_len(mbuf);
+		}
+		job = &qp->jobs[job_id_get(sqid, sq_size_get(sq), pi)];
+		/*
+		 * Build the nop + RegEx WQE set by default. The fist nop WQE
+		 * will be updated later as UMR WQE if scattered mubf exist.
+		 */
+		prep_nop_regex_wqe_set(priv, sq, op[left_ops], job, pi, &klm);
+	}
+	/*
+	 * Scattered mbuf have been added to the KLM array. Complete the build
+	 * of UMR WQE, update the first nop WQE as UMR WQE.
+	 */
+	if (mkey_klm)
+		complete_umr_wqe(qp, sq, &qp->jobs[mkey_job_id], sq->pi,
+				 klm_num, len);
+	sq->db_pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops - 1);
+	sq->pi = MLX5_REGEX_UMR_SQ_PI_IDX(sq->pi, nb_ops);
+}
+
+uint16_t
+mlx5_regexdev_enqueue_gga(struct rte_regexdev *dev, uint16_t qp_id,
+			  struct rte_regex_ops **ops, uint16_t nb_ops)
+{
+	struct mlx5_regex_priv *priv = dev->data->dev_private;
+	struct mlx5_regex_qp *queue = &priv->qps[qp_id];
+	struct mlx5_regex_sq *sq;
+	size_t sqid, nb_left = nb_ops, nb_desc;
+
+	while ((sqid = ffs(queue->free_sqs))) {
+		sqid--; /* ffs returns 1 for bit 0 */
+		sq = &queue->sqs[sqid];
+		nb_desc = get_free(sq, priv->has_umr);
+		if (nb_desc) {
+			/* The ops be handled can't exceed nb_ops. */
+			if (nb_desc > nb_left)
+				nb_desc = nb_left;
+			else
+				queue->free_sqs &= ~(1 << sqid);
+			prep_regex_umr_wqe_set(priv, queue, sq, ops, nb_desc);
+			send_doorbell(priv, sq);
+			nb_left -= nb_desc;
+		}
+		if (!nb_left)
+			break;
+		ops += nb_desc;
+	}
+	nb_ops -= nb_left;
+	queue->pi += nb_ops;
+	return nb_ops;
+}
+#endif
+
 uint16_t
 mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 		      struct rte_regex_ops **ops, uint16_t nb_ops)
@@ -186,17 +420,17 @@ mlx5_regexdev_enqueue(struct rte_regexdev *dev, uint16_t qp_id,
 	while ((sqid = ffs(queue->free_sqs))) {
 		sqid--; /* ffs returns 1 for bit 0 */
 		sq = &queue->sqs[sqid];
-		while (can_send(sq)) {
+		while (get_free(sq, priv->has_umr)) {
 			job_id = job_id_get(sqid, sq_size_get(sq), sq->pi);
 			prep_one(priv, queue, sq, ops[i], &queue->jobs[job_id]);
 			i++;
 			if (unlikely(i == nb_ops)) {
-				send_doorbell(priv->uar, sq);
+				send_doorbell(priv, sq);
 				goto out;
 			}
 		}
 		queue->free_sqs &= ~(1 << sqid);
-		send_doorbell(priv->uar, sq);
+		send_doorbell(priv, sq);
 	}
 
 out:
@@ -308,6 +542,10 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			  MLX5_REGEX_MAX_WQE_INDEX;
 		size_t sqid = cqe->rsvd3[2];
 		struct mlx5_regex_sq *sq = &queue->sqs[sqid];
+
+		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
+		if (priv->has_umr)
+			wq_counter >>= 2;
 		while (sq->ci != wq_counter) {
 			if (unlikely(i == nb_ops)) {
 				/* Return without updating cq->ci */
@@ -316,7 +554,9 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 			uint32_t job_id = job_id_get(sqid, sq_size_get(sq),
 						     sq->ci);
 			extract_result(ops[i], &queue->jobs[job_id]);
-			sq->ci = (sq->ci + 1) & MLX5_REGEX_MAX_WQE_INDEX;
+			sq->ci = (sq->ci + 1) & (priv->has_umr ?
+				 (MLX5_REGEX_MAX_WQE_INDEX >> 2) :
+				  MLX5_REGEX_MAX_WQE_INDEX);
 			i++;
 		}
 		cq->ci = (cq->ci + 1) & 0xffffff;
@@ -331,7 +571,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 }
 
 static void
-setup_sqs(struct mlx5_regex_qp *queue)
+setup_sqs(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *queue)
 {
 	size_t sqid, entry;
 	uint32_t job_id;
@@ -342,6 +582,14 @@ setup_sqs(struct mlx5_regex_qp *queue)
 			job_id = sqid * sq_size_get(sq) + entry;
 			struct mlx5_regex_job *job = &queue->jobs[job_id];
 
+			/* Fill UMR WQE with NOP in advanced. */
+			if (priv->has_umr) {
+				set_wqe_ctrl_seg
+					((struct mlx5_wqe_ctrl_seg *)wqe,
+					 entry * 2, MLX5_OPCODE_NOP, 0,
+					 sq->sq_obj.sq->id, 0, 12, 0, 0);
+				wqe += MLX5_REGEX_UMR_WQE_SIZE;
+			}
 			set_metadata_seg((struct mlx5_wqe_metadata_seg *)
 					 (wqe + MLX5_REGEX_WQE_METADATA_OFFSET),
 					 0, queue->metadata->lkey,
@@ -358,8 +606,9 @@ setup_sqs(struct mlx5_regex_qp *queue)
 }
 
 static int
-setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
+setup_buffers(struct mlx5_regex_priv *priv, struct mlx5_regex_qp *qp)
 {
+	struct ibv_pd *pd = priv->pd;
 	uint32_t i;
 	int err;
 
@@ -395,6 +644,24 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		goto err_output;
 	}
 
+	if (priv->has_umr) {
+		ptr = rte_calloc(__func__, qp->nb_desc, MLX5_REGEX_KLMS_SIZE,
+				 MLX5_REGEX_KLMS_SIZE);
+		if (!ptr) {
+			err = -ENOMEM;
+			goto err_imkey;
+		}
+		qp->imkey_addr = mlx5_glue->reg_mr(pd, ptr,
+					MLX5_REGEX_KLMS_SIZE * qp->nb_desc,
+					IBV_ACCESS_LOCAL_WRITE);
+		if (!qp->imkey_addr) {
+			rte_free(ptr);
+			DRV_LOG(ERR, "Failed to register output");
+			err = -EINVAL;
+			goto err_imkey;
+		}
+	}
+
 	/* distribute buffers to jobs */
 	for (i = 0; i < qp->nb_desc; i++) {
 		qp->jobs[i].output =
@@ -403,9 +670,18 @@ setup_buffers(struct mlx5_regex_qp *qp, struct ibv_pd *pd)
 		qp->jobs[i].metadata =
 			(uint8_t *)qp->metadata->addr +
 			(i % qp->nb_desc) * MLX5_REGEX_METADATA_SIZE;
+		if (qp->imkey_addr)
+			qp->jobs[i].imkey_array = (struct mlx5_klm *)
+				qp->imkey_addr->addr +
+				(i % qp->nb_desc) * MLX5_REGEX_MAX_KLM_NUM;
 	}
+
 	return 0;
 
+err_imkey:
+	ptr = qp->outputs->addr;
+	rte_free(ptr);
+	mlx5_glue->dereg_mr(qp->outputs);
 err_output:
 	ptr = qp->metadata->addr;
 	rte_free(ptr);
@@ -417,23 +693,57 @@ int
 mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
-	int err;
+	struct mlx5_klm klm = { 0 };
+	struct mlx5_devx_mkey_attr attr = {
+		.klm_array = &klm,
+		.klm_num = 1,
+		.umr_en = 1,
+	};
+	uint32_t i;
+	int err = 0;
 
 	qp->jobs = rte_calloc(__func__, qp->nb_desc, sizeof(*qp->jobs), 64);
 	if (!qp->jobs)
 		return -ENOMEM;
-	err = setup_buffers(qp, priv->pd);
+	err = setup_buffers(priv, qp);
 	if (err) {
 		rte_free(qp->jobs);
 		return err;
 	}
-	setup_sqs(qp);
-	return 0;
+
+	setup_sqs(priv, qp);
+
+	if (priv->has_umr) {
+#ifdef HAVE_IBV_FLOW_DV_SUPPORT
+		if (regex_get_pdn(priv->pd, &attr.pd)) {
+			err = -rte_errno;
+			DRV_LOG(ERR, "Failed to get pdn.");
+			mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			return err;
+		}
+#endif
+		for (i = 0; i < qp->nb_desc; i++) {
+			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
+			attr.klm_array = qp->jobs[i].imkey_array;
+			qp->jobs[i].imkey = mlx5_devx_cmd_mkey_create(priv->ctx,
+								      &attr);
+			if (!qp->jobs[i].imkey) {
+				err = -rte_errno;
+				DRV_LOG(ERR, "Failed to allocate imkey.");
+				mlx5_regexdev_teardown_fastpath(priv, qp_id);
+			}
+		}
+	}
+	return err;
 }
 
 static void
 free_buffers(struct mlx5_regex_qp *qp)
 {
+	if (qp->imkey_addr) {
+		mlx5_glue->dereg_mr(qp->imkey_addr);
+		rte_free(qp->imkey_addr->addr);
+	}
 	if (qp->metadata) {
 		mlx5_glue->dereg_mr(qp->metadata);
 		rte_free(qp->metadata->addr);
@@ -448,8 +758,14 @@ void
 mlx5_regexdev_teardown_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 {
 	struct mlx5_regex_qp *qp = &priv->qps[qp_id];
+	uint32_t i;
 
 	if (qp) {
+		for (i = 0; i < qp->nb_desc; i++) {
+			if (qp->jobs[i].imkey)
+				claim_zero(mlx5_devx_cmd_destroy
+							(qp->jobs[i].imkey));
+		}
 		free_buffers(qp);
 		if (qp->jobs)
 			rte_free(qp->jobs);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v5 3/3] app/test-regex: support scattered mbuf input
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 1/3] common/mlx5: add user memory registration bits Suanming Mou
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
@ 2021-04-07  7:21   ` Suanming Mou
  2021-04-08 20:53   ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon
  3 siblings, 0 replies; 36+ messages in thread
From: Suanming Mou @ 2021-04-07  7:21 UTC (permalink / raw)
  To: orika; +Cc: dev, viacheslavo, matan, rasland

This commits adds the scattered mbuf input support.

Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
---
 app/test-regex/main.c          | 134 +++++++++++++++++++++++++++------
 doc/guides/tools/testregex.rst |   3 +
 2 files changed, 112 insertions(+), 25 deletions(-)

diff --git a/app/test-regex/main.c b/app/test-regex/main.c
index aea4fa6b88..82cffaacfa 100644
--- a/app/test-regex/main.c
+++ b/app/test-regex/main.c
@@ -35,6 +35,7 @@ enum app_args {
 	ARG_NUM_OF_ITERATIONS,
 	ARG_NUM_OF_QPS,
 	ARG_NUM_OF_LCORES,
+	ARG_NUM_OF_MBUF_SEGS,
 };
 
 struct job_ctx {
@@ -70,6 +71,7 @@ struct regex_conf {
 	char *data_buf;
 	long data_len;
 	long job_len;
+	uint32_t nb_segs;
 };
 
 static void
@@ -82,14 +84,15 @@ usage(const char *prog_name)
 		" --perf N: only outputs the performance data\n"
 		" --nb_iter N: number of iteration to run\n"
 		" --nb_qps N: number of queues to use\n"
-		" --nb_lcores N: number of lcores to use\n",
+		" --nb_lcores N: number of lcores to use\n"
+		" --nb_segs N: number of mbuf segments\n",
 		prog_name);
 }
 
 static void
 args_parse(int argc, char **argv, char *rules_file, char *data_file,
 	   uint32_t *nb_jobs, bool *perf_mode, uint32_t *nb_iterations,
-	   uint32_t *nb_qps, uint32_t *nb_lcores)
+	   uint32_t *nb_qps, uint32_t *nb_lcores, uint32_t *nb_segs)
 {
 	char **argvopt;
 	int opt;
@@ -111,6 +114,8 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		{ "nb_qps", 1, 0, ARG_NUM_OF_QPS},
 		/* Number of lcores. */
 		{ "nb_lcores", 1, 0, ARG_NUM_OF_LCORES},
+		/* Number of mbuf segments. */
+		{ "nb_segs", 1, 0, ARG_NUM_OF_MBUF_SEGS},
 		/* End of options */
 		{ 0, 0, 0, 0 }
 	};
@@ -150,6 +155,9 @@ args_parse(int argc, char **argv, char *rules_file, char *data_file,
 		case ARG_NUM_OF_LCORES:
 			*nb_lcores = atoi(optarg);
 			break;
+		case ARG_NUM_OF_MBUF_SEGS:
+			*nb_segs = atoi(optarg);
+			break;
 		case ARG_HELP:
 			usage("RegEx test app");
 			break;
@@ -302,11 +310,75 @@ extbuf_free_cb(void *addr __rte_unused, void *fcb_opaque __rte_unused)
 {
 }
 
+static inline struct rte_mbuf *
+regex_create_segmented_mbuf(struct rte_mempool *mbuf_pool, int pkt_len,
+		int nb_segs, void *buf) {
+
+	struct rte_mbuf *m = NULL, *mbuf = NULL;
+	uint8_t *dst;
+	char *src = buf;
+	int data_len = 0;
+	int i, size;
+	int t_len;
+
+	if (pkt_len < 1) {
+		printf("Packet size must be 1 or more (is %d)\n", pkt_len);
+		return NULL;
+	}
+
+	if (nb_segs < 1) {
+		printf("Number of segments must be 1 or more (is %d)\n",
+				nb_segs);
+		return NULL;
+	}
+
+	t_len = pkt_len >= nb_segs ? (pkt_len / nb_segs +
+				     !!(pkt_len % nb_segs)) : 1;
+	size = pkt_len;
+
+	/* Create chained mbuf_src and fill it with buf data */
+	for (i = 0; size > 0; i++) {
+
+		m = rte_pktmbuf_alloc(mbuf_pool);
+		if (i == 0)
+			mbuf = m;
+
+		if (m == NULL) {
+			printf("Cannot create segment for source mbuf");
+			goto fail;
+		}
+
+		data_len = size > t_len ? t_len : size;
+		memset(rte_pktmbuf_mtod(m, uint8_t *), 0,
+				rte_pktmbuf_tailroom(m));
+		memcpy(rte_pktmbuf_mtod(m, uint8_t *), src, data_len);
+		dst = (uint8_t *)rte_pktmbuf_append(m, data_len);
+		if (dst == NULL) {
+			printf("Cannot append %d bytes to the mbuf\n",
+					data_len);
+			goto fail;
+		}
+
+		if (mbuf != m)
+			rte_pktmbuf_chain(mbuf, m);
+		src += data_len;
+		size -= data_len;
+
+	}
+	return mbuf;
+
+fail:
+	if (mbuf)
+		rte_pktmbuf_free(mbuf);
+	return NULL;
+}
+
 static int
 run_regex(void *args)
 {
 	struct regex_conf *rgxc = args;
 	uint32_t nb_jobs = rgxc->nb_jobs;
+	uint32_t nb_segs = rgxc->nb_segs;
 	uint32_t nb_iterations = rgxc->nb_iterations;
 	uint8_t nb_max_matches = rgxc->nb_max_matches;
 	uint32_t nb_qps = rgxc->nb_qps;
@@ -338,8 +410,12 @@ run_regex(void *args)
 	snprintf(mbuf_pool,
 		 sizeof(mbuf_pool),
 		 "mbuf_pool_%2u", qp_id_base);
-	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool, nb_jobs * nb_qps, 0,
-			0, MBUF_SIZE, rte_socket_id());
+	mbuf_mp = rte_pktmbuf_pool_create(mbuf_pool,
+			rte_align32pow2(nb_jobs * nb_qps * nb_segs),
+			0, 0, (nb_segs == 1) ? MBUF_SIZE :
+			(rte_align32pow2(job_len) / nb_segs +
+			RTE_PKTMBUF_HEADROOM),
+			rte_socket_id());
 	if (mbuf_mp == NULL) {
 		printf("Error, can't create memory pool\n");
 		return -ENOMEM;
@@ -375,8 +451,19 @@ run_regex(void *args)
 			goto end;
 		}
 
+		if (clone_buf(data_buf, &buf, data_len)) {
+			printf("Error, can't clone buf.\n");
+			res = -EXIT_FAILURE;
+			goto end;
+		}
+
+		/* Assign each mbuf with the data to handle. */
+		actual_jobs = 0;
+		pos = 0;
 		/* Allocate the jobs and assign each job with an mbuf. */
-		for (i = 0; i < nb_jobs; i++) {
+		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
+			long act_job_len = RTE_MIN(job_len, data_len - pos);
+
 			ops[i] = rte_malloc(NULL, sizeof(*ops[0]) +
 					nb_max_matches *
 					sizeof(struct rte_regexdev_match), 0);
@@ -386,30 +473,26 @@ run_regex(void *args)
 				res = -ENOMEM;
 				goto end;
 			}
-			ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+			if (nb_segs > 1) {
+				ops[i]->mbuf = regex_create_segmented_mbuf
+							(mbuf_mp, act_job_len,
+							 nb_segs, &buf[pos]);
+			} else {
+				ops[i]->mbuf = rte_pktmbuf_alloc(mbuf_mp);
+				if (ops[i]->mbuf) {
+					rte_pktmbuf_attach_extbuf(ops[i]->mbuf,
+					&buf[pos], 0, act_job_len, &shinfo);
+					ops[i]->mbuf->data_len = job_len;
+					ops[i]->mbuf->pkt_len = act_job_len;
+				}
+			}
 			if (!ops[i]->mbuf) {
-				printf("Error, can't attach mbuf.\n");
+				printf("Error, can't add mbuf.\n");
 				res = -ENOMEM;
 				goto end;
 			}
-		}
 
-		if (clone_buf(data_buf, &buf, data_len)) {
-			printf("Error, can't clone buf.\n");
-			res = -EXIT_FAILURE;
-			goto end;
-		}
-
-		/* Assign each mbuf with the data to handle. */
-		actual_jobs = 0;
-		pos = 0;
-		for (i = 0; (pos < data_len) && (i < nb_jobs) ; i++) {
-			long act_job_len = RTE_MIN(job_len, data_len - pos);
-			rte_pktmbuf_attach_extbuf(ops[i]->mbuf, &buf[pos], 0,
-					act_job_len, &shinfo);
 			jobs_ctx[i].mbuf = ops[i]->mbuf;
-			ops[i]->mbuf->data_len = job_len;
-			ops[i]->mbuf->pkt_len = act_job_len;
 			ops[i]->user_id = i;
 			ops[i]->group_id0 = 1;
 			pos += act_job_len;
@@ -612,7 +695,7 @@ main(int argc, char **argv)
 	char *data_buf;
 	long data_len;
 	long job_len;
-	uint32_t nb_lcores = 1;
+	uint32_t nb_lcores = 1, nb_segs = 1;
 	struct regex_conf *rgxc;
 	uint32_t i;
 	struct qps_per_lcore *qps_per_lcore;
@@ -626,7 +709,7 @@ main(int argc, char **argv)
 	if (argc > 1)
 		args_parse(argc, argv, rules_file, data_file, &nb_jobs,
 				&perf_mode, &nb_iterations, &nb_qps,
-				&nb_lcores);
+				&nb_lcores, &nb_segs);
 
 	if (nb_qps == 0)
 		rte_exit(EXIT_FAILURE, "Number of QPs must be greater than 0\n");
@@ -656,6 +739,7 @@ main(int argc, char **argv)
 	for (i = 0; i < nb_lcores; i++) {
 		rgxc[i] = (struct regex_conf){
 			.nb_jobs = nb_jobs,
+			.nb_segs = nb_segs,
 			.perf_mode = perf_mode,
 			.nb_iterations = nb_iterations,
 			.nb_max_matches = nb_max_matches,
diff --git a/doc/guides/tools/testregex.rst b/doc/guides/tools/testregex.rst
index a59acd919f..cdb1ffd6ee 100644
--- a/doc/guides/tools/testregex.rst
+++ b/doc/guides/tools/testregex.rst
@@ -68,6 +68,9 @@ Application Options
 ``--nb_iter N``
   number of iteration to run
 
+``--nb_segs N``
+  number of mbuf segment
+
 ``--help``
   print application options
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf
  2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
                     ` (2 preceding siblings ...)
  2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 3/3] app/test-regex: support scattered mbuf input Suanming Mou
@ 2021-04-08 20:53   ` Thomas Monjalon
  3 siblings, 0 replies; 36+ messages in thread
From: Thomas Monjalon @ 2021-04-08 20:53 UTC (permalink / raw)
  To: Suanming Mou; +Cc: orika, dev, viacheslavo, matan, rasland, asafp

> Suanming Mou (3):
>   common/mlx5: add user memory registration bits
>   regex/mlx5: add data path scattered mbuf process
>   app/test-regex: support scattered mbuf input

Applied, thanks



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-04-08 20:54 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09 23:57 [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Suanming Mou
2021-03-09 23:57 ` [dpdk-dev] [PATCH 1/3] common/mlx5: add user memory registration bits Suanming Mou
2021-03-09 23:57 ` [dpdk-dev] [PATCH 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
2021-03-09 23:57 ` [dpdk-dev] [PATCH 3/3] app/test-regex: support scattered mbuf input Suanming Mou
2021-03-24 21:14 ` [dpdk-dev] [PATCH 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon
2021-03-25  4:32 ` [dpdk-dev] [PATCH v2 0/4] " Suanming Mou
2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 1/4] common/mlx5: add user memory registration bits Suanming Mou
2021-03-29  9:29     ` Ori Kam
2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
2021-03-29  9:34     ` Ori Kam
2021-03-29  9:52       ` Suanming Mou
2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 3/4] app/test-regex: support scattered mbuf input Suanming Mou
2021-03-29  9:27     ` Ori Kam
2021-03-25  4:32   ` [dpdk-dev] [PATCH v2 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
2021-03-29  9:35     ` Ori Kam
2021-03-30  1:39 ` [dpdk-dev] [PATCH v3 0/4] regex/mlx5: support scattered mbuf Suanming Mou
2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 1/4] common/mlx5: add user memory registration bits Suanming Mou
2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
2021-03-30  8:05     ` Slava Ovsiienko
2021-03-30  9:00       ` Suanming Mou
2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 3/4] app/test-regex: support scattered mbuf input Suanming Mou
2021-03-30  1:39   ` [dpdk-dev] [PATCH v3 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
2021-04-06 16:22     ` Thomas Monjalon
2021-04-07  1:00       ` Suanming Mou
2021-04-07  7:11         ` Thomas Monjalon
2021-04-07  7:14           ` Suanming Mou
2021-03-31  7:37 ` [dpdk-dev] [PATCH v4 0/4] regex/mlx5: support scattered mbuf Suanming Mou
2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 1/4] common/mlx5: add user memory registration bits Suanming Mou
2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 2/4] regex/mlx5: add data path scattered mbuf process Suanming Mou
2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 3/4] app/test-regex: support scattered mbuf input Suanming Mou
2021-03-31  7:37   ` [dpdk-dev] [PATCH v4 4/4] regex/mlx5: prevent wrong calculation of free sqs in umr mode Suanming Mou
2021-04-07  7:21 ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Suanming Mou
2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 1/3] common/mlx5: add user memory registration bits Suanming Mou
2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 2/3] regex/mlx5: add data path scattered mbuf process Suanming Mou
2021-04-07  7:21   ` [dpdk-dev] [PATCH v5 3/3] app/test-regex: support scattered mbuf input Suanming Mou
2021-04-08 20:53   ` [dpdk-dev] [PATCH v5 0/3] regex/mlx5: support scattered mbuf Thomas Monjalon

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git