DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration
@ 2021-08-18  9:07 Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 1/4] mempool: add event callbacks Dmitry Kozlyuk
                   ` (4 more replies)
  0 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-18  9:07 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for a more thorough explanation of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 doc/guides/nics/mlx5.rst               |  11 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 564 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/linux/mlx5_os.h       |   2 +
 drivers/net/mlx5/mlx5.c                | 128 ++++++
 drivers/net/mlx5/mlx5.h                |  13 +
 drivers/net/mlx5/mlx5_mr.c             |  27 ++
 drivers/net/mlx5/mlx5_trigger.c        |  10 +-
 lib/mempool/rte_mempool.c              | 153 ++++++-
 lib/mempool/rte_mempool.h              |  60 +++
 lib/mempool/version.map                |   8 +
 17 files changed, 1110 insertions(+), 9 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH 1/4] mempool: add event callbacks
  2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
@ 2021-08-18  9:07 ` Dmitry Kozlyuk
  2021-10-12  3:12   ` Jerin Jacob
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-18  9:07 UTC (permalink / raw)
  To: dev
  Cc: Matan Azrad, Olivier Matz, Andrew Rybchenko, Ray Kinsella,
	Anatoly Burakov

Performance of MLX5 PMD of different classes can benefit if PMD knows
which memory it will need to handle in advance, before the first mbuf
is sent to the PMD. It is impractical, however, to consider
all allocated memory for this purpose. Most often mbuf memory comes
from mempools that can come and go. PMD can enumerate existing mempools
on device start, but it also needs to track mempool creation
and destruction after the forwarding starts but before an mbuf
from the new mempool is sent to the device.

Add an internal API to register callback for mempool lify cycle events,
currently RTE_MEMPOOL_EVENT_CREATE and RTE_MEMPOOL_EVENT_DESTROY:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 lib/mempool/rte_mempool.c | 153 ++++++++++++++++++++++++++++++++++++--
 lib/mempool/rte_mempool.h |  56 ++++++++++++++
 lib/mempool/version.map   |   8 ++
 3 files changed, 212 insertions(+), 5 deletions(-)

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 59a588425b..0ec56ad278 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -722,6 +734,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -778,10 +791,10 @@ rte_mempool_cache_free(struct rte_mempool_cache *cache)
 }
 
 /* create an empty mempool */
-struct rte_mempool *
-rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
-	unsigned cache_size, unsigned private_data_size,
-	int socket_id, unsigned flags)
+static struct rte_mempool *
+mempool_create_empty(const char *name, unsigned int n,
+	unsigned int elt_size, unsigned int cache_size,
+	unsigned int private_data_size, int socket_id, unsigned int flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_mempool_list *mempool_list;
@@ -915,6 +928,19 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 	return NULL;
 }
 
+struct rte_mempool *
+rte_mempool_create_empty(const char *name, unsigned int n,
+	unsigned int elt_size, unsigned int cache_size,
+	unsigned int private_data_size, int socket_id, unsigned int flags)
+{
+	struct rte_mempool *mp;
+
+	mp = mempool_create_empty(name, n, elt_size, cache_size,
+		private_data_size, socket_id, flags);
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_CREATE, mp);
+	return mp;
+}
+
 /* create the mempool */
 struct rte_mempool *
 rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
@@ -926,7 +952,7 @@ rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
 	int ret;
 	struct rte_mempool *mp;
 
-	mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
+	mp = mempool_create_empty(name, n, elt_size, cache_size,
 		private_data_size, socket_id, flags);
 	if (mp == NULL)
 		return NULL;
@@ -958,6 +984,8 @@ rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
 	if (obj_init)
 		rte_mempool_obj_iter(mp, obj_init, obj_init_arg);
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_CREATE, mp);
+
 	rte_mempool_trace_create(name, n, elt_size, cache_size,
 		private_data_size, mp_init, mp_init_arg, obj_init,
 		obj_init_arg, flags, mp);
@@ -1343,3 +1371,118 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback {
+	rte_mempool_event_callback *func;
+	void *arg;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->arg);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *arg)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	void *tmp_te;
+	int ret;
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb =
+					(struct mempool_callback *)te->data;
+		if (cb->func == func && cb->arg == arg) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->arg = arg;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *arg)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	int ret;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		rte_errno = EPERM;
+		return -1;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+	ret = -ENOENT;
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = (struct mempool_callback *)te->data;
+		if (cb->func == func && cb->arg == arg)
+			break;
+	}
+	if (te != NULL) {
+		TAILQ_REMOVE(list, te, next);
+		ret = 0;
+	}
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 4235d6f0bf..1e9b8f0229 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1775,6 +1775,62 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a successful mempool creation. */
+	RTE_MEMPOOL_EVENT_CREATE = 0,
+	/** Occurs before destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *arg);
+
+/**
+ * @internal
+ * Register a callback invoked on mempool life cycle event.
+ * Callbacks will be invoked in the process that creates the mempool.
+ *
+ * @param cb
+ *   Callback function.
+ * @param cb_arg
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *cb,
+				    void *cb_arg);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p cb and @p arg must exactly match registration parameters.
+ *
+ * @param cb
+ *   Callback function.
+ * @param cb_arg
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *cb,
+				      void *cb_arg);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH 2/4] mempool: add non-IO flag
  2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-08-18  9:07 ` Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-18  9:07 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Olivier Matz, Andrew Rybchenko

Mempool is a generic allocator that is not necessarily used for device
IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
such mempools.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/rel_notes/release_21_11.rst | 3 +++
 lib/mempool/rte_mempool.h              | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d707a554ef..dc9b98b862 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -84,6 +84,9 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e9b8f0229..7f0657ab16 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -263,6 +263,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */
 
 /**
  * @internal When debug is enabled, store some statistics.
@@ -992,6 +993,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
  *     "single-consumer". Otherwise, it is "multi-consumers".
  *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
  *     necessarily be contiguous in IO memory.
+ *   - MEMPOOL_F_NO_IO: If set, the mempool is considered to be
+ *     never used for device IO, i.e. DMA operations,
+ *     which may affect some PMD behavior.
  * @return
  *   The pointer to the new allocated mempool, on success. NULL on error
  *   with rte_errno set appropriately. Possible rte_errno values include:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH 3/4] common/mlx5: add mempool registration facilities
  2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-08-18  9:07 ` Dmitry Kozlyuk
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-18  9:07 UTC (permalink / raw)
  To: dev
  Cc: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko, Ray Kinsella,
	Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 564 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 650 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..21a83d6e1b 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,541 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr =  mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(ERR, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			standalone = mlx5_mempool_reg_detach(mpr);
+			LIST_REMOVE(mpr, next);
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index e5cb6b7060..d5e9635a14 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -149,4 +149,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH 4/4] net/mlx5: support mempool registration
  2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
                   ` (2 preceding siblings ...)
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-08-18  9:07 ` Dmitry Kozlyuk
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-18  9:07 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  11 +++
 doc/guides/rel_notes/release_21_11.rst |   6 ++
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/linux/mlx5_os.h       |   2 +
 drivers/net/mlx5/mlx5.c                | 128 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  13 +++
 drivers/net/mlx5/mlx5_mr.c             |  27 ++++++
 drivers/net/mlx5/mlx5_trigger.c        |  10 +-
 9 files changed, 241 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..58d1c5b65c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,17 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. The effect is that when a packet
+  from a mempool is transmitted, its memory is already registered for DMA
+  in the PMD and no registration will happen on the data path. The tradeoff is
+  extra work on the creation of each mempool and increased HW resource use
+  if some mempools are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index dc9b98b862..0a2f80aa1b 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -55,6 +55,12 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 5f8766aa48..7dceadb6cc 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2136,6 +2135,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index 2991d37df2..eb7e1dd3c6 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -20,5 +20,7 @@ enum {
 #define MLX5_NAMESIZE IF_NAMESIZE
 
 int mlx5_auxiliary_get_ifindex(const char *sf_name);
+void mlx5_mempool_event_cb(enum rte_mempool_event event,
+			   struct rte_mempool *mp, void *arg);
 
 #endif /* RTE_PMD_MLX5_OS_H_ */
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f84e061fe7..d0bc7c7007 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -178,6 +178,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1085,6 +1088,120 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Register the mempool for the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being registered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register(struct mlx5_dev_ctx_shared *sh,
+				     struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id) < 0)
+		DRV_LOG(ERR, "Failed to register mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_register
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_CREATE:
+		mlx5_dev_ctx_shared_mempool_register(sh, mp);
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct mlx5_dev_ctx_shared *sh)
+{
+	int ret;
+
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1282,6 +1399,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1302,6 +1421,12 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret == 0 || rte_errno != ENOENT)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1991,6 +2116,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2051,6 +2178,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index e02714e231..1f6944ba9a 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -175,6 +182,9 @@ struct mlx5_local_data {
 
 extern struct mlx5_shared_data *mlx5_shared_data;
 
+/* Exposed to copy into the shared data in OS-specific module. */
+extern int mlx5_net_mempool_slot;
+
 /* Dev ops structs */
 extern const struct eth_dev_ops mlx5_dev_ops;
 extern const struct eth_dev_ops mlx5_dev_sec_ops;
@@ -270,6 +280,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1497,6 +1509,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct mlx5_dev_ctx_shared *sh);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..1cd7d4ced0 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -128,9 +128,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..6a027f87bf 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -1124,6 +1124,13 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (priv->config.mr_mempool_reg_en) {
+		if (mlx5_dev_ctx_shared_mempool_subscribe(priv->sh) != 0) {
+			DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+				dev->data->port_id, rte_strerror(rte_errno));
+			goto error;
+		}
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
@@ -1193,11 +1200,10 @@ mlx5_dev_stop(struct rte_eth_dev *dev)
 	if (priv->obj_ops.lb_dummy_queue_release)
 		priv->obj_ops.lb_dummy_queue_release(dev);
 	mlx5_txpp_stop(dev);
-
 	return 0;
 }
 
-/**
+/*
  * Enable traffic flows configured by control plane
  *
  * @param dev
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration
  2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
                   ` (3 preceding siblings ...)
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-09-29 14:52 ` dkozlyuk
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks dkozlyuk
                     ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: dkozlyuk @ 2021-09-29 14:52 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Kozlyuk

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for a more thorough explanation of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/test/test_mempool.c                |  75 ++++
 doc/guides/nics/mlx5.rst               |  11 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 143 +++++-
 lib/mempool/rte_mempool.h              |  60 +++
 lib/mempool/version.map                |   8 +
 21 files changed, 1297 insertions(+), 119 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
@ 2021-09-29 14:52   ` dkozlyuk
  2021-10-05 16:34     ` Thomas Monjalon
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag dkozlyuk
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: dkozlyuk @ 2021-09-29 14:52 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Kozlyuk, Matan Azrad, Olivier Matz, Andrew Rybchenko,
	Ray Kinsella, Anatoly Burakov

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

Performance of MLX5 PMD of different classes can benefit if PMD knows
which memory it will need to handle in advance, before the first mbuf
is sent to the PMD. It is impractical, however, to consider
all allocated memory for this purpose. Most often mbuf memory comes
from mempools that can come and go. PMD can enumerate existing mempools
on device start, but it also needs to track creation and destruction
of mempools after the forwarding starts but before an mbuf from the new
mempool is sent to the device.

Add an internal API to register callback for mempool lify cycle events,
currently RTE_MEMPOOL_EVENT_READY (after populating)
and RTE_MEMPOOL_EVENT_DESTROY (before freeing):
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Provide a unit test for the new API.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 app/test/test_mempool.c   |  75 ++++++++++++++++++++
 lib/mempool/rte_mempool.c | 143 +++++++++++++++++++++++++++++++++++++-
 lib/mempool/rte_mempool.h |  56 +++++++++++++++
 lib/mempool/version.map   |   8 +++
 4 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 7675a3e605..0c4ed7c60b 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -471,6 +472,74 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *arg)
+{
+	struct test_mempool_events_data *data = arg;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+	struct test_mempool_events_data data;
+	struct rte_mempool *mp;
+	int ret;
+
+	ret = rte_mempool_event_callback_register(NULL, &data);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Registered a NULL callback");
+
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_event_callback_register(test_mempool_events_cb,
+						  &data);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback: %s",
+			      rte_strerror(rte_errno));
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create an empty mempool: %s",
+				 rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Callback invoked on an empty mempool creation");
+
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate the mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Callback not invoked on an empty mempool population");
+	RTE_TEST_ASSERT_EQUAL(data.event, RTE_MEMPOOL_EVENT_READY,
+			      "Wrong callback invoked, expected READY");
+	RTE_TEST_ASSERT_EQUAL(data.mp, mp,
+			      "Callback invoked for a wrong mempool");
+
+	memset(&data, 0, sizeof(data));
+	rte_mempool_free(mp);
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Callback not invoked on mempool destruction");
+	RTE_TEST_ASSERT_EQUAL(data.event, RTE_MEMPOOL_EVENT_DESTROY,
+			      "Wrong callback invoked, expected DESTROY");
+	RTE_TEST_ASSERT_EQUAL(data.mp, mp,
+			      "Callback invoked for a wrong mempool");
+
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback: %s",
+			      rte_strerror(rte_errno));
+	return 0;
+}
+
 static int
 test_mempool(void)
 {
@@ -645,6 +714,12 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 59a588425b..c6cb99ba48 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -779,9 +796,9 @@ rte_mempool_cache_free(struct rte_mempool_cache *cache)
 
 /* create an empty mempool */
 struct rte_mempool *
-rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
-	unsigned cache_size, unsigned private_data_size,
-	int socket_id, unsigned flags)
+rte_mempool_create_empty(const char *name, unsigned int n,
+	unsigned int elt_size, unsigned int cache_size,
+	unsigned int private_data_size, int socket_id, unsigned int flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_mempool_list *mempool_list;
@@ -1343,3 +1360,123 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback {
+	rte_mempool_event_callback *func;
+	void *arg;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->arg);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *arg)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb =
+					(struct mempool_callback *)te->data;
+		if (cb->func == func && cb->arg == arg) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->arg = arg;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *arg)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	int ret;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		rte_errno = EPERM;
+		return -1;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+	ret = -ENOENT;
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = (struct mempool_callback *)te->data;
+		if (cb->func == func && cb->arg == arg)
+			break;
+	}
+	if (te != NULL) {
+		TAILQ_REMOVE(list, te, next);
+		ret = 0;
+	}
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 4235d6f0bf..c81e488851 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1775,6 +1775,62 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is successfully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *arg);
+
+/**
+ * @internal
+ * Register a callback invoked on mempool life cycle event.
+ * Callbacks will be invoked in the process that creates the mempool.
+ *
+ * @param cb
+ *   Callback function.
+ * @param cb_arg
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *cb,
+				    void *cb_arg);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p cb and @p arg must exactly match registration parameters.
+ *
+ * @param cb
+ *   Callback function.
+ * @param cb_arg
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *cb,
+				      void *cb_arg);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks dkozlyuk
@ 2021-09-29 14:52   ` dkozlyuk
  2021-10-05 16:39     ` Thomas Monjalon
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: add mempool registration facilities dkozlyuk
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: dkozlyuk @ 2021-09-29 14:52 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Kozlyuk, Matan Azrad, Olivier Matz, Andrew Rybchenko

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

Mempool is a generic allocator that is not necessarily used for device
IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
such mempools.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/rel_notes/release_21_11.rst | 3 +++
 lib/mempool/rte_mempool.h              | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index f85dc99c8b..873beda633 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -155,6 +155,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index c81e488851..4d18957d6d 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -263,6 +263,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */
 
 /**
  * @internal When debug is enabled, store some statistics.
@@ -992,6 +993,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
  *     "single-consumer". Otherwise, it is "multi-consumers".
  *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
  *     necessarily be contiguous in IO memory.
+ *   - MEMPOOL_F_NO_IO: If set, the mempool is considered to be
+ *     never used for device IO, i.e. DMA operations,
+ *     which may affect some PMD behavior.
  * @return
  *   The pointer to the new allocated mempool, on success. NULL on error
  *   with rte_errno set appropriately. Possible rte_errno values include:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v2 3/4] common/mlx5: add mempool registration facilities
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks dkozlyuk
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag dkozlyuk
@ 2021-09-29 14:52   ` dkozlyuk
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: support mempool registration dkozlyuk
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: dkozlyuk @ 2021-09-29 14:52 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Kozlyuk, Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella,
	Anatoly Burakov

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index e5cb6b7060..d5e9635a14 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -149,4 +149,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v2 4/4] net/mlx5: support mempool registration
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
                     ` (2 preceding siblings ...)
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: add mempool registration facilities dkozlyuk
@ 2021-09-29 14:52   ` dkozlyuk
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: dkozlyuk @ 2021-09-29 14:52 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Kozlyuk, Matan Azrad, Viacheslav Ovsiienko

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  11 ++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 345 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..58d1c5b65c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,17 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. The effect is that when a packet
+  from a mempool is transmitted, its memory is already registered for DMA
+  in the PMD and no registration will happen on the data path. The tradeoff is
+  extra work on the creation of each mempool and increased HW resource use
+  if some mempools are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 873beda633..1fc09faf96 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -106,6 +106,12 @@ New Features
   * Added tests to validate packets hard expiry.
   * Added tests to verify tunnel header verification in IPsec inbound.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 470b16cb9a..78ed588746 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f84e061fe7..a16be05f7c 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -178,6 +178,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1085,6 +1088,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1282,6 +1420,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1302,6 +1442,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1991,6 +2140,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2051,6 +2202,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index e02714e231..89630d569b 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1497,6 +1506,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 3f2b99fb65..3952e64d50 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index abd8ce7989..4696f0b851 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1165,6 +1165,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1244,6 +1245,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1446,6 +1457,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks dkozlyuk
@ 2021-10-05 16:34     ` Thomas Monjalon
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Monjalon @ 2021-10-05 16:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Matan Azrad, Olivier Matz, Andrew Rybchenko, Ray Kinsella,
	Anatoly Burakov

29/09/2021 16:52, dkozlyuk@oss.nvidia.com:
> From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
> 
> Performance of MLX5 PMD of different classes can benefit if PMD knows
> which memory it will need to handle in advance, before the first mbuf
> is sent to the PMD. It is impractical, however, to consider
> all allocated memory for this purpose. Most often mbuf memory comes
> from mempools that can come and go. PMD can enumerate existing mempools
> on device start, but it also needs to track creation and destruction
> of mempools after the forwarding starts but before an mbuf from the new
> mempool is sent to the device.

I'm not sure this introduction about mlx5 is appropriate.

> Add an internal API to register callback for mempool lify cycle events,

lify -> life

> currently RTE_MEMPOOL_EVENT_READY (after populating)
> and RTE_MEMPOOL_EVENT_DESTROY (before freeing):
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
> Provide a unit test for the new API.
[...]
> -rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
> -	unsigned cache_size, unsigned private_data_size,
> -	int socket_id, unsigned flags)
> +rte_mempool_create_empty(const char *name, unsigned int n,
> +	unsigned int elt_size, unsigned int cache_size,
> +	unsigned int private_data_size, int socket_id, unsigned int flags)

This change looks unrelated.

> +enum rte_mempool_event {
> +	/** Occurs after a mempool is successfully populated. */
> +	RTE_MEMPOOL_EVENT_READY = 0,
> +	/** Occurs before destruction of a mempool begins. */
> +	RTE_MEMPOOL_EVENT_DESTROY = 1,
> +};

These events look OK.

> +typedef void (rte_mempool_event_callback)(
> +		enum rte_mempool_event event,
> +		struct rte_mempool *mp,
> +		void *arg);

Instead of "arg", I prefer the name "user_data".




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag dkozlyuk
@ 2021-10-05 16:39     ` Thomas Monjalon
  2021-10-12  6:06       ` Andrew Rybchenko
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Monjalon @ 2021-10-05 16:39 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Matan Azrad, Olivier Matz, Andrew Rybchenko, mdr

29/09/2021 16:52, dkozlyuk@oss.nvidia.com:
> From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
> 
> Mempool is a generic allocator that is not necessarily used for device
> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
> such mempools.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---
>  doc/guides/rel_notes/release_21_11.rst | 3 +++
>  lib/mempool/rte_mempool.h              | 4 ++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index f85dc99c8b..873beda633 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -155,6 +155,9 @@ API Changes
> +* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
> +  that objects from this pool will not be used for device IO (e.g. DMA).

This is not a breaking change, but I am OK to add this note.
Any other opinion?

> +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */

> + *   - MEMPOOL_F_NO_IO: If set, the mempool is considered to be
> + *     never used for device IO, i.e. DMA operations,
> + *     which may affect some PMD behavior.

Not limited to PMD, it may affect some libs.
I would reword the last line like this:
"No impact on mempool behaviour, but it is a hint for other components."




^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit mempool registration
  2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
                     ` (3 preceding siblings ...)
  2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: support mempool registration dkozlyuk
@ 2021-10-12  0:04   ` Dmitry Kozlyuk
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks Dmitry Kozlyuk
                       ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  0:04 UTC (permalink / raw)
  To: dev; +Cc: Thomas Monjalon

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for a more thorough explanation of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/test/test_mempool.c                |  75 ++++
 doc/guides/nics/mlx5.rst               |  11 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 137 ++++++
 lib/mempool/rte_mempool.h              |  60 +++
 lib/mempool/version.map                |   8 +
 21 files changed, 1294 insertions(+), 116 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-12  0:04     ` Dmitry Kozlyuk
  2021-10-12  6:33       ` Andrew Rybchenko
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  0:04 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Matan Azrad, Olivier Matz, Andrew Rybchenko,
	Ray Kinsella, Anatoly Burakov

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an internal API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 app/test/test_mempool.c   |  75 +++++++++++++++++++++
 lib/mempool/rte_mempool.c | 137 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.h |  56 ++++++++++++++++
 lib/mempool/version.map   |   8 +++
 4 files changed, 276 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 7675a3e605..0c4ed7c60b 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -471,6 +472,74 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *arg)
+{
+	struct test_mempool_events_data *data = arg;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+	struct test_mempool_events_data data;
+	struct rte_mempool *mp;
+	int ret;
+
+	ret = rte_mempool_event_callback_register(NULL, &data);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Registered a NULL callback");
+
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_event_callback_register(test_mempool_events_cb,
+						  &data);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback: %s",
+			      rte_strerror(rte_errno));
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create an empty mempool: %s",
+				 rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Callback invoked on an empty mempool creation");
+
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate the mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Callback not invoked on an empty mempool population");
+	RTE_TEST_ASSERT_EQUAL(data.event, RTE_MEMPOOL_EVENT_READY,
+			      "Wrong callback invoked, expected READY");
+	RTE_TEST_ASSERT_EQUAL(data.mp, mp,
+			      "Callback invoked for a wrong mempool");
+
+	memset(&data, 0, sizeof(data));
+	rte_mempool_free(mp);
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Callback not invoked on mempool destruction");
+	RTE_TEST_ASSERT_EQUAL(data.event, RTE_MEMPOOL_EVENT_DESTROY,
+			      "Wrong callback invoked, expected DESTROY");
+	RTE_TEST_ASSERT_EQUAL(data.mp, mp,
+			      "Callback invoked for a wrong mempool");
+
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback: %s",
+			      rte_strerror(rte_errno));
+	return 0;
+}
+
 static int
 test_mempool(void)
 {
@@ -645,6 +714,12 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index c5f859ae71..51c0ba2931 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1343,3 +1360,123 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb =
+					(struct mempool_callback *)te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	int ret;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		rte_errno = EPERM;
+		return -1;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+	ret = -ENOENT;
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = (struct mempool_callback *)te->data;
+		if (cb->func == func && cb->user_data == user_data)
+			break;
+	}
+	if (te != NULL) {
+		TAILQ_REMOVE(list, te, next);
+		ret = 0;
+	}
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index f57ecbd6fc..e2bf40aa09 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1774,6 +1774,62 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is successfully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback invoked on mempool life cycle event.
+ * Callbacks will be invoked in the process that creates the mempool.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-12  0:04     ` Dmitry Kozlyuk
  2021-10-12  3:37       ` Jerin Jacob
  2021-10-12  6:42       ` Andrew Rybchenko
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  0:04 UTC (permalink / raw)
  To: dev; +Cc: Thomas Monjalon, Matan Azrad, Olivier Matz, Andrew Rybchenko

Mempool is a generic allocator that is not necessarily used for device
IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
such mempools.
Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/rel_notes/release_21_11.rst | 3 +++
 lib/mempool/rte_mempool.h              | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 5036641842..dbabdc9759 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -208,6 +208,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index e2bf40aa09..b48d9f89c2 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -262,6 +262,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */
 
 /**
  * @internal When debug is enabled, store some statistics.
@@ -991,6 +992,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
  *     "single-consumer". Otherwise, it is "multi-consumers".
  *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
  *     necessarily be contiguous in IO memory.
+ *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
+ *     never used for device IO, i.e. for DMA operations.
+ *     It's a hint to other components and does not affect the mempool behavior.
  * @return
  *   The pointer to the new allocated mempool, on success. NULL on error
  *   with rte_errno set appropriately. Possible rte_errno values include:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v3 3/4] common/mlx5: add mempool registration facilities
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-12  0:04     ` Dmitry Kozlyuk
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  0:04 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella,
	Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v3 4/4] net/mlx5: support mempool registration
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                       ` (2 preceding siblings ...)
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-12  0:04     ` Dmitry Kozlyuk
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  0:04 UTC (permalink / raw)
  To: dev; +Cc: Thomas Monjalon, Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  11 ++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 345 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..58d1c5b65c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,17 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. The effect is that when a packet
+  from a mempool is transmitted, its memory is already registered for DMA
+  in the PMD and no registration will happen on the data path. The tradeoff is
+  extra work on the creation of each mempool and increased HW resource use
+  if some mempools are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index dbabdc9759..467840bff5 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -141,6 +141,12 @@ New Features
   * Added tests to validate packets hard expiry.
   * Added tests to verify tunnel header verification in IPsec inbound.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] mempool: add event callbacks
  2021-08-18  9:07 ` [dpdk-dev] [PATCH 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-12  3:12   ` Jerin Jacob
  0 siblings, 0 replies; 82+ messages in thread
From: Jerin Jacob @ 2021-10-12  3:12 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Matan Azrad, Olivier Matz, Andrew Rybchenko,
	Ray Kinsella, Anatoly Burakov

On Wed, Aug 18, 2021 at 2:38 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> Performance of MLX5 PMD of different classes can benefit if PMD knows
> which memory it will need to handle in advance, before the first mbuf
> is sent to the PMD. It is impractical, however, to consider
> all allocated memory for this purpose. Most often mbuf memory comes
> from mempools that can come and go. PMD can enumerate existing mempools
> on device start, but it also needs to track mempool creation
> and destruction after the forwarding starts but before an mbuf
> from the new mempool is sent to the device.
>
> Add an internal API to register callback for mempool lify cycle events,
> currently RTE_MEMPOOL_EVENT_CREATE and RTE_MEMPOOL_EVENT_DESTROY:
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>



> ---
>  lib/mempool/rte_mempool.c | 153 ++++++++++++++++++++++++++++++++++++--
>  lib/mempool/rte_mempool.h |  56 ++++++++++++++
>  lib/mempool/version.map   |   8 ++
>  3 files changed, 212 insertions(+), 5 deletions(-)
>
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 59a588425b..0ec56ad278 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
>  };
>  EAL_REGISTER_TAILQ(rte_mempool_tailq)
>
> +TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
> +
> +static struct rte_tailq_elem callback_tailq = {
> +       .name = "RTE_MEMPOOL_CALLBACK",
> +};
> +EAL_REGISTER_TAILQ(callback_tailq)
> +
> +/* Invoke all registered mempool event callbacks. */
> +static void
> +mempool_event_callback_invoke(enum rte_mempool_event event,
> +                             struct rte_mempool *mp);
> +
>  #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
>  #define CALC_CACHE_FLUSHTHRESH(c)      \
>         ((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
> @@ -722,6 +734,7 @@ rte_mempool_free(struct rte_mempool *mp)
>         }
>         rte_mcfg_tailq_write_unlock();
>
> +       mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
>         rte_mempool_trace_free(mp);
>         rte_mempool_free_memchunks(mp);
>         rte_mempool_ops_free(mp);
> @@ -778,10 +791,10 @@ rte_mempool_cache_free(struct rte_mempool_cache *cache)
>  }
>
>  /* create an empty mempool */
> -struct rte_mempool *
> -rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
> -       unsigned cache_size, unsigned private_data_size,
> -       int socket_id, unsigned flags)
> +static struct rte_mempool *
> +mempool_create_empty(const char *name, unsigned int n,
> +       unsigned int elt_size, unsigned int cache_size,
> +       unsigned int private_data_size, int socket_id, unsigned int flags)
>  {
>         char mz_name[RTE_MEMZONE_NAMESIZE];
>         struct rte_mempool_list *mempool_list;
> @@ -915,6 +928,19 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
>         return NULL;
>  }
>
> +struct rte_mempool *
> +rte_mempool_create_empty(const char *name, unsigned int n,
> +       unsigned int elt_size, unsigned int cache_size,
> +       unsigned int private_data_size, int socket_id, unsigned int flags)
> +{
> +       struct rte_mempool *mp;
> +
> +       mp = mempool_create_empty(name, n, elt_size, cache_size,
> +               private_data_size, socket_id, flags);
> +       mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_CREATE, mp);
> +       return mp;
> +}
> +
>  /* create the mempool */
>  struct rte_mempool *
>  rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
> @@ -926,7 +952,7 @@ rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
>         int ret;
>         struct rte_mempool *mp;
>
> -       mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
> +       mp = mempool_create_empty(name, n, elt_size, cache_size,
>                 private_data_size, socket_id, flags);
>         if (mp == NULL)
>                 return NULL;
> @@ -958,6 +984,8 @@ rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
>         if (obj_init)
>                 rte_mempool_obj_iter(mp, obj_init, obj_init_arg);
>
> +       mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_CREATE, mp);
> +
>         rte_mempool_trace_create(name, n, elt_size, cache_size,
>                 private_data_size, mp_init, mp_init_arg, obj_init,
>                 obj_init_arg, flags, mp);
> @@ -1343,3 +1371,118 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
>
>         rte_mcfg_mempool_read_unlock();
>  }
> +
> +struct mempool_callback {
> +       rte_mempool_event_callback *func;
> +       void *arg;
> +};
> +
> +static void
> +mempool_event_callback_invoke(enum rte_mempool_event event,
> +                             struct rte_mempool *mp)
> +{
> +       struct mempool_callback_list *list;
> +       struct rte_tailq_entry *te;
> +       void *tmp_te;
> +
> +       rte_mcfg_tailq_read_lock();
> +       list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +       TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +               struct mempool_callback *cb = te->data;
> +               rte_mcfg_tailq_read_unlock();
> +               cb->func(event, mp, cb->arg);
> +               rte_mcfg_tailq_read_lock();
> +       }
> +       rte_mcfg_tailq_read_unlock();
> +}
> +
> +int
> +rte_mempool_event_callback_register(rte_mempool_event_callback *func,
> +                                   void *arg)
> +{
> +       struct mempool_callback_list *list;
> +       struct rte_tailq_entry *te = NULL;
> +       struct mempool_callback *cb;
> +       void *tmp_te;
> +       int ret;
> +
> +       rte_mcfg_mempool_read_lock();
> +       rte_mcfg_tailq_write_lock();
> +
> +       list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +       TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +               struct mempool_callback *cb =
> +                                       (struct mempool_callback *)te->data;
> +               if (cb->func == func && cb->arg == arg) {
> +                       ret = -EEXIST;
> +                       goto exit;
> +               }
> +       }
> +
> +       te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
> +       if (te == NULL) {
> +               RTE_LOG(ERR, MEMPOOL,
> +                       "Cannot allocate event callback tailq entry!\n");
> +               ret = -ENOMEM;
> +               goto exit;
> +       }
> +
> +       cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
> +       if (cb == NULL) {
> +               RTE_LOG(ERR, MEMPOOL,
> +                       "Cannot allocate event callback!\n");
> +               rte_free(te);
> +               ret = -ENOMEM;
> +               goto exit;
> +       }
> +
> +       cb->func = func;
> +       cb->arg = arg;
> +       te->data = cb;
> +       TAILQ_INSERT_TAIL(list, te, next);
> +       ret = 0;
> +
> +exit:
> +       rte_mcfg_tailq_write_unlock();
> +       rte_mcfg_mempool_read_unlock();
> +       rte_errno = -ret;
> +       return ret;
> +}
> +
> +int
> +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> +                                     void *arg)
> +{
> +       struct mempool_callback_list *list;
> +       struct rte_tailq_entry *te = NULL;
> +       struct mempool_callback *cb;
> +       int ret;
> +
> +       if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> +               rte_errno = EPERM;
> +               return -1;
> +       }
> +
> +       rte_mcfg_mempool_read_lock();
> +       rte_mcfg_tailq_write_lock();
> +       ret = -ENOENT;
> +       list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +       TAILQ_FOREACH(te, list, next) {
> +               cb = (struct mempool_callback *)te->data;
> +               if (cb->func == func && cb->arg == arg)
> +                       break;
> +       }
> +       if (te != NULL) {
> +               TAILQ_REMOVE(list, te, next);
> +               ret = 0;
> +       }
> +       rte_mcfg_tailq_write_unlock();
> +       rte_mcfg_mempool_read_unlock();
> +
> +       if (ret == 0) {
> +               rte_free(te);
> +               rte_free(cb);
> +       }
> +       rte_errno = -ret;
> +       return ret;
> +}
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 4235d6f0bf..1e9b8f0229 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1775,6 +1775,62 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
>  int
>  rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
>
> +/**
> + * Mempool event type.
> + * @internal
> + */
> +enum rte_mempool_event {
> +       /** Occurs after a successful mempool creation. */
> +       RTE_MEMPOOL_EVENT_CREATE = 0,
> +       /** Occurs before destruction of a mempool begins. */
> +       RTE_MEMPOOL_EVENT_DESTROY = 1,
> +};
> +
> +/**
> + * @internal
> + * Mempool event callback.
> + */
> +typedef void (rte_mempool_event_callback)(
> +               enum rte_mempool_event event,
> +               struct rte_mempool *mp,
> +               void *arg);
> +
> +/**
> + * @internal
> + * Register a callback invoked on mempool life cycle event.
> + * Callbacks will be invoked in the process that creates the mempool.
> + *
> + * @param cb
> + *   Callback function.
> + * @param cb_arg
> + *   User data.
> + *
> + * @return
> + *   0 on success, negative on failure and rte_errno is set.
> + */
> +__rte_internal
> +int
> +rte_mempool_event_callback_register(rte_mempool_event_callback *cb,
> +                                   void *cb_arg);
> +
> +/**
> + * @internal
> + * Unregister a callback added with rte_mempool_event_callback_register().
> + * @p cb and @p arg must exactly match registration parameters.
> + *
> + * @param cb
> + *   Callback function.
> + * @param cb_arg
> + *   User data.
> + *
> + * @return
> + *   0 on success, negative on failure and rte_errno is set.
> + */
> +__rte_internal
> +int
> +rte_mempool_event_callback_unregister(rte_mempool_event_callback *cb,
> +                                     void *cb_arg);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/mempool/version.map b/lib/mempool/version.map
> index 9f77da6fff..1b7d7c5456 100644
> --- a/lib/mempool/version.map
> +++ b/lib/mempool/version.map
> @@ -64,3 +64,11 @@ EXPERIMENTAL {
>         __rte_mempool_trace_ops_free;
>         __rte_mempool_trace_set_ops_byname;
>  };
> +
> +INTERNAL {
> +       global:
> +
> +       # added in 21.11
> +       rte_mempool_event_callback_register;
> +       rte_mempool_event_callback_unregister;
> +};
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-12  3:37       ` Jerin Jacob
  2021-10-12  6:42       ` Andrew Rybchenko
  1 sibling, 0 replies; 82+ messages in thread
From: Jerin Jacob @ 2021-10-12  3:37 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Thomas Monjalon, Matan Azrad, Olivier Matz, Andrew Rybchenko

On Tue, Oct 12, 2021 at 5:34 AM Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> wrote:
>
> Mempool is a generic allocator that is not necessarily used for device
> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
> such mempools.
> Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html
>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---
>  doc/guides/rel_notes/release_21_11.rst | 3 +++
>  lib/mempool/rte_mempool.h              | 4 ++++
>  2 files changed, 7 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 5036641842..dbabdc9759 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -208,6 +208,9 @@ API Changes
>    the crypto/security operation. This field will be used to communicate
>    events such as soft expiry with IPsec in lookaside mode.
>
> +* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
> +  that objects from this pool will not be used for device IO (e.g. DMA).
> +
>
>  ABI Changes
>  -----------
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index e2bf40aa09..b48d9f89c2 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -262,6 +262,7 @@ struct rte_mempool {
>  #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
>  #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
>  #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
> +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */

Since it is the hint, How about changing the flag to  MEMPOOL_F_HINT_NON_IO.
Otherwise, it looks good to me.
Acked-by: Jerin Jacob <jerinj@marvell.com>


>
>  /**
>   * @internal When debug is enabled, store some statistics.
> @@ -991,6 +992,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
>   *     "single-consumer". Otherwise, it is "multi-consumers".
>   *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
>   *     necessarily be contiguous in IO memory.
> + *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
> + *     never used for device IO, i.e. for DMA operations.
> + *     It's a hint to other components and does not affect the mempool behavior.
>   * @return
>   *   The pointer to the new allocated mempool, on success. NULL on error
>   *   with rte_errno set appropriately. Possible rte_errno values include:
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag
  2021-10-05 16:39     ` Thomas Monjalon
@ 2021-10-12  6:06       ` Andrew Rybchenko
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-12  6:06 UTC (permalink / raw)
  To: Thomas Monjalon, Dmitry Kozlyuk; +Cc: dev, Matan Azrad, Olivier Matz, mdr

On 10/5/21 7:39 PM, Thomas Monjalon wrote:
> 29/09/2021 16:52, dkozlyuk@oss.nvidia.com:
>> From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
>>
>> Mempool is a generic allocator that is not necessarily used for device
>> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
>> such mempools.
>>
>> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>
>> Acked-by: Matan Azrad <matan@nvidia.com>
>> ---
>>  doc/guides/rel_notes/release_21_11.rst | 3 +++
>>  lib/mempool/rte_mempool.h              | 4 ++++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
>> index f85dc99c8b..873beda633 100644
>> --- a/doc/guides/rel_notes/release_21_11.rst
>> +++ b/doc/guides/rel_notes/release_21_11.rst
>> @@ -155,6 +155,9 @@ API Changes
>> +* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
>> +  that objects from this pool will not be used for device IO (e.g. DMA).
> 
> This is not a breaking change, but I am OK to add this note.
> Any other opinion?

Why is it in API changes section? Shouldn't it be in
"New Features"?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-12  6:33       ` Andrew Rybchenko
  2021-10-12  9:37         ` Dmitry Kozlyuk
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-12  6:33 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Thomas Monjalon, Matan Azrad, Olivier Matz, Ray Kinsella,
	Anatoly Burakov

On 10/12/21 3:04 AM, Dmitry Kozlyuk wrote:
> Data path performance can benefit if the PMD knows which memory it will
> need to handle in advance, before the first mbuf is sent to the PMD.
> It is impractical, however, to consider all allocated memory for this
> purpose. Most often mbuf memory comes from mempools that can come and
> go. PMD can enumerate existing mempools on device start, but it also
> needs to track creation and destruction of mempools after the forwarding
> starts but before an mbuf from the new mempool is sent to the device.
> 
> Add an internal API to register callback for mempool life cycle events:
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
> Currently tracked events are:
> * RTE_MEMPOOL_EVENT_READY (after populating a mempool)
> * RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
> Provide a unit test for the new API.

Good idea.

> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>

[snip]

I think it would be very useful to test two callbacks as well
including new mempool creation after one of callbacks
unregister. Plus register/unregister callbacks from a callback
itself. Feel free to drop it, since increasing test coverage
is almost endless :)

> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index f57ecbd6fc..e2bf40aa09 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1774,6 +1774,62 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
>  int
>  rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
>  
> +/**
> + * Mempool event type.
> + * @internal
> + */
> +enum rte_mempool_event {
> +	/** Occurs after a mempool is successfully populated. */

successfully -> fully ?

> +	RTE_MEMPOOL_EVENT_READY = 0,
> +	/** Occurs before destruction of a mempool begins. */
> +	RTE_MEMPOOL_EVENT_DESTROY = 1,
> +};
> +
> +/**
> + * @internal
> + * Mempool event callback.
> + */
> +typedef void (rte_mempool_event_callback)(
> +		enum rte_mempool_event event,
> +		struct rte_mempool *mp,
> +		void *user_data);
> +
> +/**
> + * @internal

I'd like to understand why the API is internal (not
experimental). I think reasons should be clear from
function description.

> + * Register a callback invoked on mempool life cycle event.
> + * Callbacks will be invoked in the process that creates the mempool.
> + *
> + * @param func
> + *   Callback function.
> + * @param user_data
> + *   User data.
> + *
> + * @return
> + *   0 on success, negative on failure and rte_errno is set.
> + */
> +__rte_internal
> +int
> +rte_mempool_event_callback_register(rte_mempool_event_callback *func,
> +				    void *user_data);
> +
> +/**
> + * @internal
> + * Unregister a callback added with rte_mempool_event_callback_register().
> + * @p func and @p user_data must exactly match registration parameters.
> + *
> + * @param func
> + *   Callback function.
> + * @param user_data
> + *   User data.
> + *
> + * @return
> + *   0 on success, negative on failure and rte_errno is set.
> + */
> +__rte_internal
> +int
> +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> +				      void *user_data);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/mempool/version.map b/lib/mempool/version.map
> index 9f77da6fff..1b7d7c5456 100644
> --- a/lib/mempool/version.map
> +++ b/lib/mempool/version.map
> @@ -64,3 +64,11 @@ EXPERIMENTAL {
>  	__rte_mempool_trace_ops_free;
>  	__rte_mempool_trace_set_ops_byname;
>  };
> +
> +INTERNAL {
> +	global:
> +
> +	# added in 21.11
> +	rte_mempool_event_callback_register;
> +	rte_mempool_event_callback_unregister;
> +};
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag Dmitry Kozlyuk
  2021-10-12  3:37       ` Jerin Jacob
@ 2021-10-12  6:42       ` Andrew Rybchenko
  2021-10-12 12:40         ` Dmitry Kozlyuk
  1 sibling, 1 reply; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-12  6:42 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Thomas Monjalon, Matan Azrad, Olivier Matz

On 10/12/21 3:04 AM, Dmitry Kozlyuk wrote:
> Mempool is a generic allocator that is not necessarily used for device
> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
> such mempools.
> Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---
>  doc/guides/rel_notes/release_21_11.rst | 3 +++
>  lib/mempool/rte_mempool.h              | 4 ++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 5036641842..dbabdc9759 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -208,6 +208,9 @@ API Changes
>    the crypto/security operation. This field will be used to communicate
>    events such as soft expiry with IPsec in lookaside mode.
>  
> +* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
> +  that objects from this pool will not be used for device IO (e.g. DMA).
> +
>  
>  ABI Changes
>  -----------
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index e2bf40aa09..b48d9f89c2 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -262,6 +262,7 @@ struct rte_mempool {
>  #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
>  #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
>  #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
> +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO (DMA). */

Doesn't it imply MEMPOOL_F_NO_IOVA_CONTIG?
Shouldn't it reject mempool population with not RTE_BAD_IOVA
iova parameter?

I see that it is just a hint, but just trying to make
full picture consistent.

As the second thought: isn't iova==RTE_BAD_IOVA
sufficient as a hint?

>  
>  /**
>   * @internal When debug is enabled, store some statistics.
> @@ -991,6 +992,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
>   *     "single-consumer". Otherwise, it is "multi-consumers".
>   *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
>   *     necessarily be contiguous in IO memory.
> + *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
> + *     never used for device IO, i.e. for DMA operations.
> + *     It's a hint to other components and does not affect the mempool behavior.
>   * @return
>   *   The pointer to the new allocated mempool, on success. NULL on error
>   *   with rte_errno set appropriately. Possible rte_errno values include:
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks
  2021-10-12  6:33       ` Andrew Rybchenko
@ 2021-10-12  9:37         ` Dmitry Kozlyuk
  2021-10-12  9:46           ` Andrew Rybchenko
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12  9:37 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: NBU-Contact-Thomas Monjalon, Matan Azrad, Olivier Matz,
	Ray Kinsella, Anatoly Burakov

> On 10/12/21 3:04 AM, Dmitry Kozlyuk wrote:
> > Data path performance can benefit if the PMD knows which memory it
> > will need to handle in advance, before the first mbuf is sent to the PMD.
> [...]
> 
> I'd like to understand why the API is internal (not experimental). I think reasons
> should be clear from function description.

My reasoning was that PMDs need this API while applications don't. PMDs may need to deal with any mempools and don't control their creation, while the application knows which mempools it creates and doesn't care about internal mempools PMDs might create. But maybe I was wrong and there are applications that want to use mbufs from those internal mempools for non-DPDK IO, something SPDK-like. I'll add a note about that in the description and make them experimental.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks
  2021-10-12  9:37         ` Dmitry Kozlyuk
@ 2021-10-12  9:46           ` Andrew Rybchenko
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-12  9:46 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: NBU-Contact-Thomas Monjalon, Matan Azrad, Olivier Matz,
	Ray Kinsella, Anatoly Burakov

On 10/12/21 12:37 PM, Dmitry Kozlyuk wrote:
>> On 10/12/21 3:04 AM, Dmitry Kozlyuk wrote:
>>> Data path performance can benefit if the PMD knows which memory it
>>> will need to handle in advance, before the first mbuf is sent to the PMD.
>> [...]
>>
>> I'd like to understand why the API is internal (not experimental). I think reasons
>> should be clear from function description.
> 
> My reasoning was that PMDs need this API while applications don't. PMDs may need to deal with any mempools and don't control their creation, while the application knows which mempools it creates and doesn't care about internal mempools PMDs might create. But maybe I was wrong and there are applications that want to use mbufs from those internal mempools for non-DPDK IO, something SPDK-like. I'll add a note about that in the description and make them experimental.
> 

It is a good explanation. Thanks.
May be it is safer to keep it internal until
we find the first external user.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12  6:42       ` Andrew Rybchenko
@ 2021-10-12 12:40         ` Dmitry Kozlyuk
  2021-10-12 12:53           ` Andrew Rybchenko
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12 12:40 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: NBU-Contact-Thomas Monjalon, Matan Azrad, Olivier Matz

> [...]
> > +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO
> (DMA). */
> 
> Doesn't it imply MEMPOOL_F_NO_IOVA_CONTIG?

Let's leave this explicit. NO_IOVA_CONFIG could result in MEMZONE_IOVA_CONTIG (although it doesn't now), which can affect how many pages are used, which may affect performance due to TLB caches.

> Shouldn't it reject mempool population with not RTE_BAD_IOVA iova
> parameter?
>
> I see that it is just a hint, but just trying to make full picture consistent.
> 
> As the second thought: isn't iova==RTE_BAD_IOVA sufficient as a hint?

1. It looks true that if RTE_BAD_IOVA is used, we can infer it's a non-IO mempool.
2. The new flag is needed or at least handly, because otherwise to check this property of a mempool, but how? Allocating a test mbuf is doable but looks like a hack. Or we can pass this information to the callback, complicating its signature. Do you think it's better?
3. Theoretically, user may want to use mempools for objects that are used for IO, but not with DPDK. In this case IOVA will be valid, but the flag can also be set.

So I'd keep the flag and also infer it for RTE_BAD_IOVA, but allow normal IOVA.
What do you think?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12 12:40         ` Dmitry Kozlyuk
@ 2021-10-12 12:53           ` Andrew Rybchenko
  2021-10-12 13:11             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-12 12:53 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: NBU-Contact-Thomas Monjalon, Matan Azrad, Olivier Matz

On 10/12/21 3:40 PM, Dmitry Kozlyuk wrote:
>> [...]
>>> +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO
>> (DMA). */
>>
>> Doesn't it imply MEMPOOL_F_NO_IOVA_CONTIG?
> 
> Let's leave this explicit. NO_IOVA_CONFIG could result in MEMZONE_IOVA_CONTIG (although it doesn't now), which can affect how many pages are used, which may affect performance due to TLB caches.

It sounds like a usage of a side effect of
MEMPOOL_F_NO_IOVA_CONTIG absence. It does not
sound good.

> 
>> Shouldn't it reject mempool population with not RTE_BAD_IOVA iova
>> parameter?
>>
>> I see that it is just a hint, but just trying to make full picture consistent.
>>
>> As the second thought: isn't iova==RTE_BAD_IOVA sufficient as a hint?
> 
> 1. It looks true that if RTE_BAD_IOVA is used, we can infer it's a non-IO mempool.
> 2. The new flag is needed or at least handly, because otherwise to check this property of a mempool, but how? Allocating a test mbuf is doable but looks like a hack. Or we can pass this information to the callback, complicating its signature. Do you think it's better?

mempool knows it when the mempool is populated.
So, it can just set the flag itself.

> 3. Theoretically, user may want to use mempools for objects that are used for IO, but not with DPDK. In this case IOVA will be valid, but the flag can also be set.

It sounds very artificial. Also in this case I guess
MEMPOOL_F_NON_IO should be clear anyway.

> 
> So I'd keep the flag and also infer it for RTE_BAD_IOVA, but allow normal IOVA.
> What do you think?
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag
  2021-10-12 12:53           ` Andrew Rybchenko
@ 2021-10-12 13:11             ` Dmitry Kozlyuk
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12 13:11 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: NBU-Contact-Thomas Monjalon, Matan Azrad, Olivier Matz

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: 12 октября 2021 г. 15:53
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>; dev@dpdk.org
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Matan Azrad
> <matan@nvidia.com>; Olivier Matz <olivier.matz@6wind.com>
> Subject: Re: [PATCH v3 2/4] mempool: add non-IO flag
> 
> External email: Use caution opening links or attachments
> 
> 
> On 10/12/21 3:40 PM, Dmitry Kozlyuk wrote:
> >> [...]
> >>> +#define MEMPOOL_F_NON_IO         0x0040 /**< Not used for device IO
> >> (DMA). */
> >>
> >> Doesn't it imply MEMPOOL_F_NO_IOVA_CONTIG?
> >
> > Let's leave this explicit. NO_IOVA_CONFIG could result in
> MEMZONE_IOVA_CONTIG (although it doesn't now), which can affect how
> many pages are used, which may affect performance due to TLB caches.
> 
> It sounds like a usage of a side effect of MEMPOOL_F_NO_IOVA_CONTIG
> absence. It does not sound good.

I agree, but my point is that behavior should not change when specifying a hint flag.
NO_IOVA_CONTIG  => NON_IO is feasible, NON_IO => NO_IOVA_CONTIG is against
the declared NON_IO properties.

> 
> >
> >> Shouldn't it reject mempool population with not RTE_BAD_IOVA iova
> >> parameter?
> >>
> >> I see that it is just a hint, but just trying to make full picture consistent.
> >>
> >> As the second thought: isn't iova==RTE_BAD_IOVA sufficient as a hint?
> >
> > 1. It looks true that if RTE_BAD_IOVA is used, we can infer it's a non-IO
> mempool.
> > 2. The new flag is needed or at least handly, because otherwise to check this
> property of a mempool, but how? Allocating a test mbuf is doable but looks like
> a hack. Or we can pass this information to the callback, complicating its
> signature. Do you think it's better?
> 
> mempool knows it when the mempool is populated.
> So, it can just set the flag itself.

Of course, I'm only arguing that to analyze mempool properties this flag
is needed, even if the user isn't supposed to set it themselves.
Looks like we agree on this one.

> > 3. Theoretically, user may want to use mempools for objects that are used for
> IO, but not with DPDK. In this case IOVA will be valid, but the flag can also be
> set.
> 
> It sounds very artificial.
> Also in this case I guess MEMPOOL_F_NON_IO should be clear anyway.

I see NON_IO as a hint to DPDK components, not sure about non-DPDK ones.
But since there isn't a use case indeed, the flag doesn't need to eb exposed now.

To summarize:
1. MEMPOOL_F_NON_IO is considered internal, like MEMPOOL_F_CREATED.
2. It is set automatically:
a) for MEMPOOL_F_NO_IOVA_CONTIG mempools;
b) if RTE_BAD_IOVA is used to populate.
I doubt HINT should be added to the name, because it's not a hint, it's a conclusion.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit mempool registration
  2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                       ` (3 preceding siblings ...)
  2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-13 11:01     ` Dmitry Kozlyuk
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
                         ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13 11:01 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/test/test_mempool.c                | 285 ++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 139 ++++++
 lib/mempool/rte_mempool.h              |  66 +++
 lib/mempool/version.map                |   8 +
 21 files changed, 1514 insertions(+), 116 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-13 11:01       ` Dmitry Kozlyuk
  2021-10-15  8:52         ` Andrew Rybchenko
  2021-10-15 12:12         ` Olivier Matz
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                         ` (3 subsequent siblings)
  4 siblings, 2 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13 11:01 UTC (permalink / raw)
  To: dev
  Cc: Andrew Rybchenko, Matan Azrad, Olivier Matz, Ray Kinsella,
	Anatoly Burakov

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 app/test/test_mempool.c   | 209 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 137 +++++++++++++++++++++++++
 lib/mempool/rte_mempool.h |  61 +++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 415 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 7675a3e605..bc0cc9ed48 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -471,6 +472,206 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM];
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[i],
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return 0;
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+						(test_mempool_events_safety_cb,
+						 &sdata[i]);
+	return 0;
+}
+
 static int
 test_mempool(void)
 {
@@ -645,6 +846,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index c5f859ae71..51c0ba2931 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1343,3 +1360,123 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback *cb =
+					(struct mempool_callback *)te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback *cb;
+	int ret;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		rte_errno = EPERM;
+		return -1;
+	}
+
+	rte_mcfg_mempool_read_lock();
+	rte_mcfg_tailq_write_lock();
+	ret = -ENOENT;
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = (struct mempool_callback *)te->data;
+		if (cb->func == func && cb->user_data == user_data)
+			break;
+	}
+	if (te != NULL) {
+		TAILQ_REMOVE(list, te, next);
+		ret = 0;
+	}
+	rte_mcfg_tailq_write_unlock();
+	rte_mcfg_mempool_read_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index f57ecbd6fc..663123042f 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1774,6 +1774,67 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is successfully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback invoked on mempool life cycle event.
+ * Callbacks will be invoked in the process that creates the mempool.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-13 11:01       ` Dmitry Kozlyuk
  2021-10-15  9:01         ` Andrew Rybchenko
                           ` (2 more replies)
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                         ` (2 subsequent siblings)
  4 siblings, 3 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13 11:01 UTC (permalink / raw)
  To: dev; +Cc: Andrew Rybchenko, Matan Azrad, Olivier Matz

Mempool is a generic allocator that is not necessarily used for device
IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
such mempools automatically if their objects are not contiguous
or IOVA are not available. Components can inspect this flag
in order to optimize their memory management.
Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 app/test/test_mempool.c                | 76 ++++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |  3 +
 lib/mempool/rte_mempool.c              |  2 +
 lib/mempool/rte_mempool.h              |  5 ++
 4 files changed, 86 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index bc0cc9ed48..15146dd737 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -672,6 +672,74 @@ test_mempool_events_safety(void)
 	return 0;
 }
 
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	rte_mempool_free(mp);
+	return 0;
+}
+
+static int
+test_mempool_flag_non_io_set_when_populated_with_bad_iova(void)
+{
+	void *addr;
+	size_t size = 1 << 16;
+	struct rte_mempool *mp;
+	int ret;
+
+	addr = rte_malloc("test_mempool", size, 0);
+	RTE_TEST_ASSERT_NOT_NULL(addr, "Cannot allocate memory");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_iova(mp, addr, RTE_BAD_IOVA, size,
+					NULL, NULL);
+	/* The flag must be inferred even if population isn't full. */
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with RTE_BAD_IOVA");
+	rte_mempool_free(mp);
+	rte_free(addr);
+	return 0;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	rte_mempool_free(mp);
+	return 0;
+}
+
 static int
 test_mempool(void)
 {
@@ -854,6 +922,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_populated_with_bad_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index f643a61f44..74e0e6f495 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -226,6 +226,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 51c0ba2931..2204f140b3 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
+	if (iova == RTE_BAD_IOVA)
+		mp->flags |= MEMPOOL_F_NON_IO;
 
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 663123042f..029b62a650 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -262,6 +262,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+#define MEMPOOL_F_NON_IO         0x0040
+		/**< Internal: pool is not usable for device IO (DMA). */
 
 /**
  * @internal When debug is enabled, store some statistics.
@@ -991,6 +993,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
  *     "single-consumer". Otherwise, it is "multi-consumers".
  *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
  *     necessarily be contiguous in IO memory.
+ *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
+ *     never used for device IO, i.e. for DMA operations.
+ *     It's a hint to other components and does not affect the mempool behavior.
  * @return
  *   The pointer to the new allocated mempool, on success. NULL on error
  *   with rte_errno set appropriately. Possible rte_errno values include:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v4 3/4] common/mlx5: add mempool registration facilities
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-13 11:01       ` Dmitry Kozlyuk
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13 11:01 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v4 4/4] net/mlx5: support mempool registration
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                         ` (2 preceding siblings ...)
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-13 11:01       ` Dmitry Kozlyuk
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13 11:01 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 74e0e6f495..b4126c450f 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -159,6 +159,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-15  8:52         ` Andrew Rybchenko
  2021-10-15  9:13           ` Dmitry Kozlyuk
  2021-10-19 13:08           ` Dmitry Kozlyuk
  2021-10-15 12:12         ` Olivier Matz
  1 sibling, 2 replies; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-15  8:52 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Matan Azrad, Olivier Matz, Ray Kinsella, Anatoly Burakov

On 10/13/21 2:01 PM, Dmitry Kozlyuk wrote:
> Data path performance can benefit if the PMD knows which memory it will
> need to handle in advance, before the first mbuf is sent to the PMD.
> It is impractical, however, to consider all allocated memory for this
> purpose. Most often mbuf memory comes from mempools that can come and
> go. PMD can enumerate existing mempools on device start, but it also
> needs to track creation and destruction of mempools after the forwarding
> starts but before an mbuf from the new mempool is sent to the device.
> 
> Add an API to register callback for mempool life cycle events:
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
> Currently tracked events are:
> * RTE_MEMPOOL_EVENT_READY (after populating a mempool)
> * RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
> Provide a unit test for the new API.
> The new API is internal, because it is primarily demanded by PMDs that
> may need to deal with any mempools and do not control their creation,
> while an application, on the other hand, knows which mempools it creates
> and doesn't care about internal mempools PMDs might create.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>

With below review notes processed

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

[snip]

> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index c5f859ae71..51c0ba2931 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c

[snip]

> @@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
>  	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
>  	mp->nb_mem_chunks++;
>  
> +	/* Report the mempool as ready only when fully populated. */
> +	if (mp->populated_size >= mp->size)
> +		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
> +
>  	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
>  	return i;
>  
> @@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
>  	}
>  	rte_mcfg_tailq_write_unlock();
>  
> +	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
>  	rte_mempool_trace_free(mp);
>  	rte_mempool_free_memchunks(mp);
>  	rte_mempool_ops_free(mp);
> @@ -1343,3 +1360,123 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
>  
>  	rte_mcfg_mempool_read_unlock();
>  }
> +
> +struct mempool_callback {

It sounds like it is a mempool callback itself.
Consider: mempool_event_callback_data.
I think this way it will be consistent.

> +	rte_mempool_event_callback *func;
> +	void *user_data;
> +};
> +
> +static void
> +mempool_event_callback_invoke(enum rte_mempool_event event,
> +			      struct rte_mempool *mp)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te;
> +	void *tmp_te;
> +
> +	rte_mcfg_tailq_read_lock();
> +	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +		struct mempool_callback *cb = te->data;
> +		rte_mcfg_tailq_read_unlock();
> +		cb->func(event, mp, cb->user_data);
> +		rte_mcfg_tailq_read_lock();
> +	}
> +	rte_mcfg_tailq_read_unlock();
> +}
> +
> +int
> +rte_mempool_event_callback_register(rte_mempool_event_callback *func,
> +				    void *user_data)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te = NULL;
> +	struct mempool_callback *cb;
> +	void *tmp_te;
> +	int ret;
> +
> +	if (func == NULL) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +
> +	rte_mcfg_mempool_read_lock();
> +	rte_mcfg_tailq_write_lock();
> +
> +	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +		struct mempool_callback *cb =
> +					(struct mempool_callback *)te->data;
> +		if (cb->func == func && cb->user_data == user_data) {
> +			ret = -EEXIST;
> +			goto exit;
> +		}
> +	}
> +
> +	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
> +	if (te == NULL) {
> +		RTE_LOG(ERR, MEMPOOL,
> +			"Cannot allocate event callback tailq entry!\n");
> +		ret = -ENOMEM;
> +		goto exit;
> +	}
> +
> +	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
> +	if (cb == NULL) {
> +		RTE_LOG(ERR, MEMPOOL,
> +			"Cannot allocate event callback!\n");
> +		rte_free(te);
> +		ret = -ENOMEM;
> +		goto exit;
> +	}
> +
> +	cb->func = func;
> +	cb->user_data = user_data;
> +	te->data = cb;
> +	TAILQ_INSERT_TAIL(list, te, next);
> +	ret = 0;
> +
> +exit:
> +	rte_mcfg_tailq_write_unlock();
> +	rte_mcfg_mempool_read_unlock();
> +	rte_errno = -ret;
> +	return ret;
> +}
> +
> +int
> +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> +				      void *user_data)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te = NULL;
> +	struct mempool_callback *cb;
> +	int ret;
> +
> +	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> +		rte_errno = EPERM;
> +		return -1;

The function should behave consistencly. Below you
return negative error. Here just -1. I think it
would be more constent to return -rte_errno here.

> +	}
> +
> +	rte_mcfg_mempool_read_lock();
> +	rte_mcfg_tailq_write_lock();
> +	ret = -ENOENT;
> +	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +	TAILQ_FOREACH(te, list, next) {
> +		cb = (struct mempool_callback *)te->data;
> +		if (cb->func == func && cb->user_data == user_data)
> +			break;
> +	}
> +	if (te != NULL) {

Here we rely on the fact that TAILQ_FOREACH()
exists with te==NULL in the case of no such
entry. I'd suggest to avoid the assumption.
I.e. do below two lines above before break and
have not the if condition her at all.

> +		TAILQ_REMOVE(list, te, next);
> +		ret = 0;
> +	}
> +	rte_mcfg_tailq_write_unlock();
> +	rte_mcfg_mempool_read_unlock();
> +
> +	if (ret == 0) {
> +		rte_free(te);
> +		rte_free(cb);
> +	}
> +	rte_errno = -ret;
> +	return ret;
> +}
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index f57ecbd6fc..663123042f 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1774,6 +1774,67 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
>  int
>  rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
>  
> +/**
> + * Mempool event type.
> + * @internal

Shouldn't internal go first?

> + */
> +enum rte_mempool_event {
> +	/** Occurs after a mempool is successfully populated. */

successfully -> fully ?

> +	RTE_MEMPOOL_EVENT_READY = 0,
> +	/** Occurs before destruction of a mempool begins. */
> +	RTE_MEMPOOL_EVENT_DESTROY = 1,
> +};

[snip]


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-15  9:01         ` Andrew Rybchenko
  2021-10-15  9:18           ` Dmitry Kozlyuk
  2021-10-15  9:25         ` David Marchand
  2021-10-15 13:19         ` Olivier Matz
  2 siblings, 1 reply; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-15  9:01 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Matan Azrad, Olivier Matz

On 10/13/21 2:01 PM, Dmitry Kozlyuk wrote:
> Mempool is a generic allocator that is not necessarily used for device
> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
> such mempools automatically if their objects are not contiguous
> or IOVA are not available. Components can inspect this flag
> in order to optimize their memory management.
> Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>

See review notes below. With review notes processed:

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

[snip]

> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index f643a61f44..74e0e6f495 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -226,6 +226,9 @@ API Changes
>    the crypto/security operation. This field will be used to communicate
>    events such as soft expiry with IPsec in lookaside mode.
>  
> +* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
> +  that objects from this pool will not be used for device IO (e.g. DMA).
> +
>  
>  ABI Changes
>  -----------
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 51c0ba2931..2204f140b3 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
>  
>  	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
>  	mp->nb_mem_chunks++;
> +	if (iova == RTE_BAD_IOVA)
> +		mp->flags |= MEMPOOL_F_NON_IO;

As I understand rte_mempool_populate_iova() may be called
few times for one mempool. The flag must be set if all
invocations are done with RTE_BAD_IOVA. So, it should be
set by default and just removed when iova != RTE_BAD_IOVA
happens.

Yes, it is a corner case. May be it makes sense to
cover it by unit test as well.

>  
>  	/* Report the mempool as ready only when fully populated. */
>  	if (mp->populated_size >= mp->size)
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 663123042f..029b62a650 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -262,6 +262,8 @@ struct rte_mempool {
>  #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
>  #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
>  #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
> +#define MEMPOOL_F_NON_IO         0x0040
> +		/**< Internal: pool is not usable for device IO (DMA). */

Please, put the documentation before the define.
/** Internal: pool is not usable for device IO (DMA). */
#define MEMPOOL_F_NON_IO         0x0040

>  
>  /**
>   * @internal When debug is enabled, store some statistics.
> @@ -991,6 +993,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
>   *     "single-consumer". Otherwise, it is "multi-consumers".
>   *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
>   *     necessarily be contiguous in IO memory.
> + *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
> + *     never used for device IO, i.e. for DMA operations.
> + *     It's a hint to other components and does not affect the mempool behavior.

I tend to say that it should not be here if the flag is
internal.

>   * @return
>   *   The pointer to the new allocated mempool, on success. NULL on error
>   *   with rte_errno set appropriately. Possible rte_errno values include:
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-15  8:52         ` Andrew Rybchenko
@ 2021-10-15  9:13           ` Dmitry Kozlyuk
  2021-10-19 13:08           ` Dmitry Kozlyuk
  1 sibling, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15  9:13 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: Matan Azrad, Olivier Matz, Ray Kinsella, Anatoly Burakov

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> [...]
> With below review notes processed
> 
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 

Thanks for the comments, I'll fix them all, just a small note below FYI.

> > +     rte_mcfg_mempool_read_lock();
> > +     rte_mcfg_tailq_write_lock();
> > +     ret = -ENOENT;
> > +     list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> > +     TAILQ_FOREACH(te, list, next) {
> > +             cb = (struct mempool_callback *)te->data;
> > +             if (cb->func == func && cb->user_data == user_data)
> > +                     break;
> > +     }
> > +     if (te != NULL) {
> 
> Here we rely on the fact that TAILQ_FOREACH() exists with te==NULL in the
> case of no such entry. I'd suggest to avoid the assumption.
> I.e. do below two lines above before break and have not the if condition
> her at all.

Since you asked the question, the code is non-obvious, so I'll change it.
FWIW, man 3 tailq:

    TAILQ_FOREACH() traverses the queue referenced by head in the forward
    direction, assigning each element in turn to var.  var is set to NULL
    if the loop completes normally, or if there were no elements.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:01         ` Andrew Rybchenko
@ 2021-10-15  9:18           ` Dmitry Kozlyuk
  2021-10-15  9:33             ` Andrew Rybchenko
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15  9:18 UTC (permalink / raw)
  To: Andrew Rybchenko, dev; +Cc: Matan Azrad, Olivier Matz

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> [...]
> > diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> > index 51c0ba2931..2204f140b3 100644
> > --- a/lib/mempool/rte_mempool.c
> > +++ b/lib/mempool/rte_mempool.c
> > @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool *mp,
> > char *vaddr,
> >
> >       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> >       mp->nb_mem_chunks++;
> > +     if (iova == RTE_BAD_IOVA)
> > +             mp->flags |= MEMPOOL_F_NON_IO;
> 
> As I understand rte_mempool_populate_iova() may be called few times for
> one mempool. The flag must be set if all invocations are done with
> RTE_BAD_IOVA. So, it should be set by default and just removed when iova
> != RTE_BAD_IOVA happens.

I don't agree at all. If any object of the pool is unsuitable for IO,
the pool cannot be considered suitable for IO. So if there's a single
invocation with RTE_BAD_IOVA, the flag must be set forever.

> Yes, it is a corner case. May be it makes sense to cover it by unit test
> as well.

True for either your logic or mine, I'll add it.

Ack on the rest of the comments, thanks.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
  2021-10-15  9:01         ` Andrew Rybchenko
@ 2021-10-15  9:25         ` David Marchand
  2021-10-15 10:42           ` Dmitry Kozlyuk
  2021-10-15 13:19         ` Olivier Matz
  2 siblings, 1 reply; 82+ messages in thread
From: David Marchand @ 2021-10-15  9:25 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Andrew Rybchenko, Matan Azrad, Olivier Matz

Hello Dmitry,

On Wed, Oct 13, 2021 at 1:02 PM Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> wrote:
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 663123042f..029b62a650 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -262,6 +262,8 @@ struct rte_mempool {
>  #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
>  #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
>  #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
> +#define MEMPOOL_F_NON_IO         0x0040
> +               /**< Internal: pool is not usable for device IO (DMA). */
>
>  /**
>   * @internal When debug is enabled, store some statistics.
> @@ -991,6 +993,9 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
>   *     "single-consumer". Otherwise, it is "multi-consumers".
>   *   - MEMPOOL_F_NO_IOVA_CONTIG: If set, allocated objects won't
>   *     necessarily be contiguous in IO memory.
> + *   - MEMPOOL_F_NON_IO: If set, the mempool is considered to be
> + *     never used for device IO, i.e. for DMA operations.
> + *     It's a hint to other components and does not affect the mempool behavior.
>   * @return
>   *   The pointer to the new allocated mempool, on success. NULL on error
>   *   with rte_errno set appropriately. Possible rte_errno values include:

- When rebasing on main, you probably won't be able to call this new flag.
The diff should be something like:

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index d886f4800c..35c80291fa 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -214,7 +214,7 @@ static int test_mempool_creation_with_unknown_flag(void)
                MEMPOOL_ELT_SIZE, 0, 0,
                NULL, NULL,
                NULL, NULL,
-               SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG << 1);
+               SOCKET_ID_ANY, MEMPOOL_F_NON_IO << 1);

        if (mp_cov != NULL) {
                rte_mempool_free(mp_cov);
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8d5f99f7e7..27d197fe86 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -802,6 +802,7 @@ rte_mempool_cache_free(struct rte_mempool_cache *cache)
        | MEMPOOL_F_SC_GET \
        | MEMPOOL_F_POOL_CREATED \
        | MEMPOOL_F_NO_IOVA_CONTIG \
+       | MEMPOOL_F_NON_IO \
        )
 /* create an empty mempool */
 struct rte_mempool *


- While grepping, I noticed that proc-info also dumps mempool flags.
This could be something to enhance, maybe amending current
rte_mempool_dump() and having this tool use it.
But for now, can you update this tool too?


-- 
David Marchand


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:18           ` Dmitry Kozlyuk
@ 2021-10-15  9:33             ` Andrew Rybchenko
  2021-10-15  9:38               ` Dmitry Kozlyuk
  2021-10-15  9:43               ` Olivier Matz
  0 siblings, 2 replies; 82+ messages in thread
From: Andrew Rybchenko @ 2021-10-15  9:33 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Matan Azrad, Olivier Matz

On 10/15/21 12:18 PM, Dmitry Kozlyuk wrote:
>> -----Original Message-----
>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> [...]
>>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
>>> index 51c0ba2931..2204f140b3 100644
>>> --- a/lib/mempool/rte_mempool.c
>>> +++ b/lib/mempool/rte_mempool.c
>>> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool *mp,
>>> char *vaddr,
>>>
>>>       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
>>>       mp->nb_mem_chunks++;
>>> +     if (iova == RTE_BAD_IOVA)
>>> +             mp->flags |= MEMPOOL_F_NON_IO;
>>
>> As I understand rte_mempool_populate_iova() may be called few times for
>> one mempool. The flag must be set if all invocations are done with
>> RTE_BAD_IOVA. So, it should be set by default and just removed when iova
>> != RTE_BAD_IOVA happens.
> 
> I don't agree at all. If any object of the pool is unsuitable for IO,
> the pool cannot be considered suitable for IO. So if there's a single
> invocation with RTE_BAD_IOVA, the flag must be set forever.

If so, some objects may be used for IO, some cannot be used.
What should happen if an application allocates an object
which is suitable for IO and try to use it this way?

> 
>> Yes, it is a corner case. May be it makes sense to cover it by unit test
>> as well.
> 
> True for either your logic or mine, I'll add it.
> 
> Ack on the rest of the comments, thanks.
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:33             ` Andrew Rybchenko
@ 2021-10-15  9:38               ` Dmitry Kozlyuk
  2021-10-15  9:43               ` Olivier Matz
  1 sibling, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15  9:38 UTC (permalink / raw)
  To: Andrew Rybchenko, dev; +Cc: Matan Azrad, Olivier Matz



> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: 15 октября 2021 г. 12:34
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>; dev@dpdk.org
> Cc: Matan Azrad <matan@nvidia.com>; Olivier Matz <olivier.matz@6wind.com>
> Subject: Re: [PATCH v4 2/4] mempool: add non-IO flag
> 
> External email: Use caution opening links or attachments
> 
> 
> On 10/15/21 12:18 PM, Dmitry Kozlyuk wrote:
> >> -----Original Message-----
> >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> [...]
> >>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> >>> index 51c0ba2931..2204f140b3 100644
> >>> --- a/lib/mempool/rte_mempool.c
> >>> +++ b/lib/mempool/rte_mempool.c
> >>> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool
> >>> *mp, char *vaddr,
> >>>
> >>>       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> >>>       mp->nb_mem_chunks++;
> >>> +     if (iova == RTE_BAD_IOVA)
> >>> +             mp->flags |= MEMPOOL_F_NON_IO;
> >>
> >> As I understand rte_mempool_populate_iova() may be called few times
> >> for one mempool. The flag must be set if all invocations are done
> >> with RTE_BAD_IOVA. So, it should be set by default and just removed
> >> when iova != RTE_BAD_IOVA happens.
> >
> > I don't agree at all. If any object of the pool is unsuitable for IO,
> > the pool cannot be considered suitable for IO. So if there's a single
> > invocation with RTE_BAD_IOVA, the flag must be set forever.
> 
> If so, some objects may be used for IO, some cannot be used.
> What should happen if an application allocates an object which is suitable
> for IO and try to use it this way?

Never mind, I was thinking in v1 mode when the application marked mempools
as not suitable for IO. Since now they're marked automatically, you're correct:
the flag must be set if and only if there is no chance
that objects from this pool will be used for IO.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:33             ` Andrew Rybchenko
  2021-10-15  9:38               ` Dmitry Kozlyuk
@ 2021-10-15  9:43               ` Olivier Matz
  2021-10-15  9:58                 ` Dmitry Kozlyuk
  1 sibling, 1 reply; 82+ messages in thread
From: Olivier Matz @ 2021-10-15  9:43 UTC (permalink / raw)
  To: Andrew Rybchenko; +Cc: Dmitry Kozlyuk, dev, Matan Azrad

On Fri, Oct 15, 2021 at 12:33:31PM +0300, Andrew Rybchenko wrote:
> On 10/15/21 12:18 PM, Dmitry Kozlyuk wrote:
> >> -----Original Message-----
> >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >> [...]
> >>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> >>> index 51c0ba2931..2204f140b3 100644
> >>> --- a/lib/mempool/rte_mempool.c
> >>> +++ b/lib/mempool/rte_mempool.c
> >>> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool *mp,
> >>> char *vaddr,
> >>>
> >>>       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> >>>       mp->nb_mem_chunks++;
> >>> +     if (iova == RTE_BAD_IOVA)
> >>> +             mp->flags |= MEMPOOL_F_NON_IO;
> >>
> >> As I understand rte_mempool_populate_iova() may be called few times for
> >> one mempool. The flag must be set if all invocations are done with
> >> RTE_BAD_IOVA. So, it should be set by default and just removed when iova
> >> != RTE_BAD_IOVA happens.
> > 
> > I don't agree at all. If any object of the pool is unsuitable for IO,
> > the pool cannot be considered suitable for IO. So if there's a single
> > invocation with RTE_BAD_IOVA, the flag must be set forever.
> 
> If so, some objects may be used for IO, some cannot be used.
> What should happen if an application allocates an object
> which is suitable for IO and try to use it this way?

If the application can predict if the allocated object is usable for IO
before allocating it, I would be surprised to have it used for IO. I agree
with Dmitry here.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:43               ` Olivier Matz
@ 2021-10-15  9:58                 ` Dmitry Kozlyuk
  2021-10-15 12:11                   ` Olivier Matz
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15  9:58 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko; +Cc: dev, Matan Azrad



> -----Original Message-----
> From: Olivier Matz <olivier.matz@6wind.com>
> Sent: 15 октября 2021 г. 12:43
> To: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Cc: Dmitry Kozlyuk <dkozlyuk@nvidia.com>; dev@dpdk.org; Matan Azrad
> <matan@nvidia.com>
> Subject: Re: [PATCH v4 2/4] mempool: add non-IO flag
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Oct 15, 2021 at 12:33:31PM +0300, Andrew Rybchenko wrote:
> > On 10/15/21 12:18 PM, Dmitry Kozlyuk wrote:
> > >> -----Original Message-----
> > >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> [...]
> > >>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> > >>> index 51c0ba2931..2204f140b3 100644
> > >>> --- a/lib/mempool/rte_mempool.c
> > >>> +++ b/lib/mempool/rte_mempool.c
> > >>> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool
> > >>> *mp, char *vaddr,
> > >>>
> > >>>       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> > >>>       mp->nb_mem_chunks++;
> > >>> +     if (iova == RTE_BAD_IOVA)
> > >>> +             mp->flags |= MEMPOOL_F_NON_IO;
> > >>
> > >> As I understand rte_mempool_populate_iova() may be called few times
> > >> for one mempool. The flag must be set if all invocations are done
> > >> with RTE_BAD_IOVA. So, it should be set by default and just removed
> > >> when iova != RTE_BAD_IOVA happens.
> > >
> > > I don't agree at all. If any object of the pool is unsuitable for
> > > IO, the pool cannot be considered suitable for IO. So if there's a
> > > single invocation with RTE_BAD_IOVA, the flag must be set forever.
> >
> > If so, some objects may be used for IO, some cannot be used.
> > What should happen if an application allocates an object which is
> > suitable for IO and try to use it this way?
> 
> If the application can predict if the allocated object is usable for IO
> before allocating it, I would be surprised to have it used for IO. I agree
> with Dmitry here.

The flag hints to components, PMDs before all,
that objects from this mempool will never be used for IO,
so that the component can save some memory mapping or DMA configuration.
If the flag is set when even a single object may be used for IO,
the consumer of the flag will not be ready for that.
Whatever a corner case it is, Andrew is correct.
There is a subtle difference between "pool is not usable"
(as described now) and "objects from this mempool will never be used"
(as stated above), I'll highlight it in the flag description.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:25         ` David Marchand
@ 2021-10-15 10:42           ` Dmitry Kozlyuk
  2021-10-15 11:41             ` David Marchand
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 10:42 UTC (permalink / raw)
  To: David Marchand; +Cc: dev, Andrew Rybchenko, Matan Azrad, Olivier Matz

Hello David,

> [...]
> - When rebasing on main, you probably won't be able to call this new flag.
> The diff should be something like:
> 
> diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c index
> d886f4800c..35c80291fa 100644
> --- a/app/test/test_mempool.c
> +++ b/app/test/test_mempool.c
> @@ -214,7 +214,7 @@ static int
> test_mempool_creation_with_unknown_flag(void)
>                 MEMPOOL_ELT_SIZE, 0, 0,
>                 NULL, NULL,
>                 NULL, NULL,
> -               SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG << 1);
> +               SOCKET_ID_ANY, MEMPOOL_F_NON_IO << 1);
> 
>         if (mp_cov != NULL) {
>                 rte_mempool_free(mp_cov); diff --git
> a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c index
> 8d5f99f7e7..27d197fe86 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -802,6 +802,7 @@ rte_mempool_cache_free(struct rte_mempool_cache
> *cache)
>         | MEMPOOL_F_SC_GET \
>         | MEMPOOL_F_POOL_CREATED \
>         | MEMPOOL_F_NO_IOVA_CONTIG \
> +       | MEMPOOL_F_NON_IO \

I wonder why CREATED and NON_IO should be listed here:
they are not supposed to be passed by the user,
which is what MEMPOOL_KNOWN_FLAGS is used for.
The same question stands for the test code.
Could you confirm your suggestion?

>         )
>  /* create an empty mempool */
>  struct rte_mempool *
> 
> 
> - While grepping, I noticed that proc-info also dumps mempool flags.
> This could be something to enhance, maybe amending current
> rte_mempool_dump() and having this tool use it.
> But for now, can you update this tool too?

I will, thanks for the hints.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15 10:42           ` Dmitry Kozlyuk
@ 2021-10-15 11:41             ` David Marchand
  2021-10-15 12:13               ` Olivier Matz
  0 siblings, 1 reply; 82+ messages in thread
From: David Marchand @ 2021-10-15 11:41 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Andrew Rybchenko, Matan Azrad, Olivier Matz

On Fri, Oct 15, 2021 at 12:42 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c index
> > 8d5f99f7e7..27d197fe86 100644
> > --- a/lib/mempool/rte_mempool.c
> > +++ b/lib/mempool/rte_mempool.c
> > @@ -802,6 +802,7 @@ rte_mempool_cache_free(struct rte_mempool_cache
> > *cache)
> >         | MEMPOOL_F_SC_GET \
> >         | MEMPOOL_F_POOL_CREATED \
> >         | MEMPOOL_F_NO_IOVA_CONTIG \
> > +       | MEMPOOL_F_NON_IO \
>
> I wonder why CREATED and NON_IO should be listed here:
> they are not supposed to be passed by the user,
> which is what MEMPOOL_KNOWN_FLAGS is used for.
> The same question stands for the test code.
> Could you confirm your suggestion?

There was no distinction in the API for valid flags so far, and indeed
I did not pay attention to MEMPOOL_F_POOL_CREATED and its internal
aspect.
(That's the problem when mixing stuff together)

We could separate internal and exposed flags in different fields, but
it seems overkill.
It would be seen as an API change too, if application were checking
for this flag.
So let's keep this as is.

As you suggest, we should exclude those internal flags from
KNOWN_FLAGS (probably rename it too), and we will have to export this
define for the unit test since the check had been written with
contiguous valid flags in mind.
If your new flag is internal only, I agree we must skip it.

I'll prepare a patch for mempool.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15  9:58                 ` Dmitry Kozlyuk
@ 2021-10-15 12:11                   ` Olivier Matz
  0 siblings, 0 replies; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 12:11 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: Andrew Rybchenko, dev, Matan Azrad

On Fri, Oct 15, 2021 at 09:58:49AM +0000, Dmitry Kozlyuk wrote:
> 
> 
> > -----Original Message-----
> > From: Olivier Matz <olivier.matz@6wind.com>
> > Sent: 15 октября 2021 г. 12:43
> > To: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Cc: Dmitry Kozlyuk <dkozlyuk@nvidia.com>; dev@dpdk.org; Matan Azrad
> > <matan@nvidia.com>
> > Subject: Re: [PATCH v4 2/4] mempool: add non-IO flag
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Fri, Oct 15, 2021 at 12:33:31PM +0300, Andrew Rybchenko wrote:
> > > On 10/15/21 12:18 PM, Dmitry Kozlyuk wrote:
> > > >> -----Original Message-----
> > > >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> [...]
> > > >>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> > > >>> index 51c0ba2931..2204f140b3 100644
> > > >>> --- a/lib/mempool/rte_mempool.c
> > > >>> +++ b/lib/mempool/rte_mempool.c
> > > >>> @@ -371,6 +371,8 @@ rte_mempool_populate_iova(struct rte_mempool
> > > >>> *mp, char *vaddr,
> > > >>>
> > > >>>       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> > > >>>       mp->nb_mem_chunks++;
> > > >>> +     if (iova == RTE_BAD_IOVA)
> > > >>> +             mp->flags |= MEMPOOL_F_NON_IO;
> > > >>
> > > >> As I understand rte_mempool_populate_iova() may be called few times
> > > >> for one mempool. The flag must be set if all invocations are done
> > > >> with RTE_BAD_IOVA. So, it should be set by default and just removed
> > > >> when iova != RTE_BAD_IOVA happens.
> > > >
> > > > I don't agree at all. If any object of the pool is unsuitable for
> > > > IO, the pool cannot be considered suitable for IO. So if there's a
> > > > single invocation with RTE_BAD_IOVA, the flag must be set forever.
> > >
> > > If so, some objects may be used for IO, some cannot be used.
> > > What should happen if an application allocates an object which is
> > > suitable for IO and try to use it this way?
> > 
> > If the application can predict if the allocated object is usable for IO
> > before allocating it, I would be surprised to have it used for IO. I agree
> > with Dmitry here.
> 
> The flag hints to components, PMDs before all,
> that objects from this mempool will never be used for IO,
> so that the component can save some memory mapping or DMA configuration.
> If the flag is set when even a single object may be used for IO,
> the consumer of the flag will not be ready for that.
> Whatever a corner case it is, Andrew is correct.
> There is a subtle difference between "pool is not usable"
> (as described now) and "objects from this mempool will never be used"
> (as stated above), I'll highlight it in the flag description.

OK, agreed, thanks.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-15  8:52         ` Andrew Rybchenko
@ 2021-10-15 12:12         ` Olivier Matz
  2021-10-15 13:07           ` Dmitry Kozlyuk
  1 sibling, 1 reply; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 12:12 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Andrew Rybchenko, Matan Azrad, Ray Kinsella, Anatoly Burakov

Hi Dmitry,

On Wed, Oct 13, 2021 at 02:01:28PM +0300, Dmitry Kozlyuk wrote:
> Data path performance can benefit if the PMD knows which memory it will
> need to handle in advance, before the first mbuf is sent to the PMD.
> It is impractical, however, to consider all allocated memory for this
> purpose. Most often mbuf memory comes from mempools that can come and
> go. PMD can enumerate existing mempools on device start, but it also
> needs to track creation and destruction of mempools after the forwarding
> starts but before an mbuf from the new mempool is sent to the device.
> 
> Add an API to register callback for mempool life cycle events:
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
> Currently tracked events are:
> * RTE_MEMPOOL_EVENT_READY (after populating a mempool)
> * RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
> Provide a unit test for the new API.
> The new API is internal, because it is primarily demanded by PMDs that
> may need to deal with any mempools and do not control their creation,
> while an application, on the other hand, knows which mempools it creates
> and doesn't care about internal mempools PMDs might create.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---
>  app/test/test_mempool.c   | 209 ++++++++++++++++++++++++++++++++++++++
>  lib/mempool/rte_mempool.c | 137 +++++++++++++++++++++++++
>  lib/mempool/rte_mempool.h |  61 +++++++++++
>  lib/mempool/version.map   |   8 ++
>  4 files changed, 415 insertions(+)

(...)

> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
>  };
>  EAL_REGISTER_TAILQ(rte_mempool_tailq)
>  
> +TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
> +
> +static struct rte_tailq_elem callback_tailq = {
> +	.name = "RTE_MEMPOOL_CALLBACK",
> +};
> +EAL_REGISTER_TAILQ(callback_tailq)
> +
> +/* Invoke all registered mempool event callbacks. */
> +static void
> +mempool_event_callback_invoke(enum rte_mempool_event event,
> +			      struct rte_mempool *mp);
> +
>  #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
>  #define CALC_CACHE_FLUSHTHRESH(c)	\
>  	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
> @@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
>  	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
>  	mp->nb_mem_chunks++;
>  
> +	/* Report the mempool as ready only when fully populated. */
> +	if (mp->populated_size >= mp->size)
> +		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
> +

One small comment here. I think it does not happen today, but in the
future, something that could happen is:
  - create empty mempool
  - populate mempool
  - use mempool
  - populate mempool with more objects
  - use mempool

I've seen one usage there: https://www.youtube.com/watch?v=SzQFn9tm4Sw

In that case, it would require a POPULATE event instead of a
MEMPOOL_CREATE event.

Enhancing the documentation to better explain when the callback is
invoked looks enough to me for the moment.

>  	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
>  	return i;
>  
> @@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
>  	}
>  	rte_mcfg_tailq_write_unlock();
>  
> +	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
>  	rte_mempool_trace_free(mp);
>  	rte_mempool_free_memchunks(mp);
>  	rte_mempool_ops_free(mp);
> @@ -1343,3 +1360,123 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
>  
>  	rte_mcfg_mempool_read_unlock();
>  }
> +
> +struct mempool_callback {
> +	rte_mempool_event_callback *func;
> +	void *user_data;
> +};
> +
> +static void
> +mempool_event_callback_invoke(enum rte_mempool_event event,
> +			      struct rte_mempool *mp)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te;
> +	void *tmp_te;
> +
> +	rte_mcfg_tailq_read_lock();
> +	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +		struct mempool_callback *cb = te->data;
> +		rte_mcfg_tailq_read_unlock();
> +		cb->func(event, mp, cb->user_data);
> +		rte_mcfg_tailq_read_lock();

I think it is dangerous to unlock the list before invoking the callback.
During that time, another thread can remove the next mempool callback, and
the next iteration will access to a freed element, causing an undefined
behavior.

Is it a problem to keep the lock held during the callback invocation?

I see that you have a test for this, and that you wrote a comment in the
documentation:

 * rte_mempool_event_callback_register() may be called from within the callback,
 * but the callbacks registered this way will not be invoked for the same event.
 * rte_mempool_event_callback_unregister() may only be safely called
 * to remove the running callback.

But is there a use-case for this?
If no, I'll tend to say that we can document that it is not allowed to
create, free or list mempools or register cb from the callback.


> +	}
> +	rte_mcfg_tailq_read_unlock();
> +}
> +
> +int
> +rte_mempool_event_callback_register(rte_mempool_event_callback *func,
> +				    void *user_data)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te = NULL;
> +	struct mempool_callback *cb;
> +	void *tmp_te;
> +	int ret;
> +
> +	if (func == NULL) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +
> +	rte_mcfg_mempool_read_lock();
> +	rte_mcfg_tailq_write_lock();
> +
> +	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> +	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> +		struct mempool_callback *cb =
> +					(struct mempool_callback *)te->data;
> +		if (cb->func == func && cb->user_data == user_data) {
> +			ret = -EEXIST;
> +			goto exit;
> +		}
> +	}
> +
> +	te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
> +	if (te == NULL) {
> +		RTE_LOG(ERR, MEMPOOL,
> +			"Cannot allocate event callback tailq entry!\n");
> +		ret = -ENOMEM;
> +		goto exit;
> +	}
> +
> +	cb = rte_malloc("MEMPOOL_EVENT_CALLBACK", sizeof(*cb), 0);
> +	if (cb == NULL) {
> +		RTE_LOG(ERR, MEMPOOL,
> +			"Cannot allocate event callback!\n");
> +		rte_free(te);
> +		ret = -ENOMEM;
> +		goto exit;
> +	}
> +
> +	cb->func = func;
> +	cb->user_data = user_data;
> +	te->data = cb;
> +	TAILQ_INSERT_TAIL(list, te, next);
> +	ret = 0;
> +
> +exit:
> +	rte_mcfg_tailq_write_unlock();
> +	rte_mcfg_mempool_read_unlock();
> +	rte_errno = -ret;
> +	return ret;
> +}
> +
> +int
> +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> +				      void *user_data)
> +{
> +	struct mempool_callback_list *list;
> +	struct rte_tailq_entry *te = NULL;
> +	struct mempool_callback *cb;
> +	int ret;
> +
> +	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> +		rte_errno = EPERM;
> +		return -1;
> +	}

The help of the register function says
 * Callbacks will be invoked in the process that creates the mempool.

So registration is allowed from primary or secondary process. Can't a
secondary process destroys the callback it has loaded?

> +
> +	rte_mcfg_mempool_read_lock();
> +	rte_mcfg_tailq_write_lock();

I don't understand why there are 2 locks here.

After looking at the code, I think the locking model is already
incorrect in current mempool code:

   rte_mcfg_tailq_write_lock() is used in create and free to protect the
     access to the mempool tailq

   rte_mcfg_mempool_write_lock() is used in create(), to protect from
     concurrent creation (with same name for instance), but I doubt it
     is absolutly needed, because memzone_reserve is already protected.

   rte_mcfg_mempool_read_lock() is used in dump functions, but to me
     it should use rte_mcfg_tailq_read_lock() instead.
     Currently, doing a dump and a free concurrently can cause a crash
     because they are not using the same lock.

In your case, I suggest to use only one lock to protect the callback
list. I think it can be rte_mcfg_tailq_*_lock().

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15 11:41             ` David Marchand
@ 2021-10-15 12:13               ` Olivier Matz
  0 siblings, 0 replies; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 12:13 UTC (permalink / raw)
  To: David Marchand; +Cc: Dmitry Kozlyuk, dev, Andrew Rybchenko, Matan Azrad

On Fri, Oct 15, 2021 at 01:41:40PM +0200, David Marchand wrote:
> On Fri, Oct 15, 2021 at 12:42 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > > a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c index
> > > 8d5f99f7e7..27d197fe86 100644
> > > --- a/lib/mempool/rte_mempool.c
> > > +++ b/lib/mempool/rte_mempool.c
> > > @@ -802,6 +802,7 @@ rte_mempool_cache_free(struct rte_mempool_cache
> > > *cache)
> > >         | MEMPOOL_F_SC_GET \
> > >         | MEMPOOL_F_POOL_CREATED \
> > >         | MEMPOOL_F_NO_IOVA_CONTIG \
> > > +       | MEMPOOL_F_NON_IO \
> >
> > I wonder why CREATED and NON_IO should be listed here:
> > they are not supposed to be passed by the user,
> > which is what MEMPOOL_KNOWN_FLAGS is used for.
> > The same question stands for the test code.
> > Could you confirm your suggestion?
> 
> There was no distinction in the API for valid flags so far, and indeed
> I did not pay attention to MEMPOOL_F_POOL_CREATED and its internal
> aspect.
> (That's the problem when mixing stuff together)
> 
> We could separate internal and exposed flags in different fields, but
> it seems overkill.
> It would be seen as an API change too, if application were checking
> for this flag.
> So let's keep this as is.
> 
> As you suggest, we should exclude those internal flags from
> KNOWN_FLAGS (probably rename it too), and we will have to export this

I suggest RTE_MEMPOOL_VALID_USER_FLAGS for the name

> define for the unit test since the check had been written with
> contiguous valid flags in mind.
> If your new flag is internal only, I agree we must skip it.
> 
> I'll prepare a patch for mempool.
> 
> -- 
> David Marchand
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-15 12:12         ` Olivier Matz
@ 2021-10-15 13:07           ` Dmitry Kozlyuk
  2021-10-15 13:40             ` Olivier Matz
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 13:07 UTC (permalink / raw)
  To: Olivier Matz
  Cc: dev, Andrew Rybchenko, Matan Azrad, Ray Kinsella, Anatoly Burakov

[...]
> > @@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp,
> char *vaddr,
> >       STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
> >       mp->nb_mem_chunks++;
> >
> > +     /* Report the mempool as ready only when fully populated. */
> > +     if (mp->populated_size >= mp->size)
> > +             mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY,
> mp);
> > +
> 
> One small comment here. I think it does not happen today, but in the
> future, something that could happen is:
>   - create empty mempool
>   - populate mempool
>   - use mempool
>   - populate mempool with more objects
>   - use mempool
> 
> I've seen one usage there: https://www.youtube.com/watch?v=SzQFn9tm4Sw
> 
> In that case, it would require a POPULATE event instead of a
> MEMPOOL_CREATE event.

That's a troublesome case.
Technically it can be only one POPULATE event called on each population,
even an incomplete one, and the callback can check populated_size vs size.
However, I'd keep the READY event indeed, because it's the common case,
and the first consumer of the feature, mlx5 PMD, can handle corner cases.
It is also a little more efficient. POPULATE can be added later.

> Enhancing the documentation to better explain when the callback is
> invoked looks enough to me for the moment.

Per Andrew's suggestion, it will say:
"Occurs after a mempool is fully populated".
I hope this is clear enough.

> [...]
> > +static void
> > +mempool_event_callback_invoke(enum rte_mempool_event event,
> > +                           struct rte_mempool *mp)
> > +{
> > +     struct mempool_callback_list *list;
> > +     struct rte_tailq_entry *te;
> > +     void *tmp_te;
> > +
> > +     rte_mcfg_tailq_read_lock();
> > +     list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> > +     RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> > +             struct mempool_callback *cb = te->data;
> > +             rte_mcfg_tailq_read_unlock();
> > +             cb->func(event, mp, cb->user_data);
> > +             rte_mcfg_tailq_read_lock();
> 
> I think it is dangerous to unlock the list before invoking the callback.
> During that time, another thread can remove the next mempool callback, and
> the next iteration will access to a freed element, causing an undefined
> behavior.
> 
> Is it a problem to keep the lock held during the callback invocation?
> 
> I see that you have a test for this, and that you wrote a comment in the
> documentation:
> 
>  * rte_mempool_event_callback_register() may be called from within the
> callback,
>  * but the callbacks registered this way will not be invoked for the same
> event.
>  * rte_mempool_event_callback_unregister() may only be safely called
>  * to remove the running callback.
> 
> But is there a use-case for this?
> If no, I'll tend to say that we can document that it is not allowed to
> create, free or list mempools or register cb from the callback.

There is no use-case, but I'd argue for releasing the lock.
This lock is taken by rte_xxx_create() functions in many libraries,
so the restriction is wider and, worse, it is not strictly limited.

> [...]
> > +int
> > +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> > +                                   void *user_data)
> > +{
> > +     struct mempool_callback_list *list;
> > +     struct rte_tailq_entry *te = NULL;
> > +     struct mempool_callback *cb;
> > +     int ret;
> > +
> > +     if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> > +             rte_errno = EPERM;
> > +             return -1;
> > +     }
> 
> The help of the register function says
>  * Callbacks will be invoked in the process that creates the mempool.

BTW, this is another bug, it should be "populates", not "creates".

> So registration is allowed from primary or secondary process. Can't a
> secondary process destroys the callback it has loaded?
> 
> > +
> > +     rte_mcfg_mempool_read_lock();
> > +     rte_mcfg_tailq_write_lock();
> 
> I don't understand why there are 2 locks here.
> 
> After looking at the code, I think the locking model is already
> incorrect in current mempool code:
> 
>    rte_mcfg_tailq_write_lock() is used in create and free to protect the
>      access to the mempool tailq
> 
>    rte_mcfg_mempool_write_lock() is used in create(), to protect from
>      concurrent creation (with same name for instance), but I doubt it
>      is absolutly needed, because memzone_reserve is already protected.
> 
>    rte_mcfg_mempool_read_lock() is used in dump functions, but to me
>      it should use rte_mcfg_tailq_read_lock() instead.
>      Currently, doing a dump and a free concurrently can cause a crash
>      because they are not using the same lock.
> 
> In your case, I suggest to use only one lock to protect the callback
> list. I think it can be rte_mcfg_tailq_*_lock().

Thanks, I will double-check the locking.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
  2021-10-15  9:01         ` Andrew Rybchenko
  2021-10-15  9:25         ` David Marchand
@ 2021-10-15 13:19         ` Olivier Matz
  2021-10-15 13:27           ` Dmitry Kozlyuk
  2 siblings, 1 reply; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 13:19 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Andrew Rybchenko, Matan Azrad

Hi Dmitry,

On Wed, Oct 13, 2021 at 02:01:29PM +0300, Dmitry Kozlyuk wrote:
> Mempool is a generic allocator that is not necessarily used for device
> IO operations and its memory for DMA. Add MEMPOOL_F_NON_IO flag to mark
> such mempools automatically if their objects are not contiguous
> or IOVA are not available. Components can inspect this flag
> in order to optimize their memory management.
> Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---
>  app/test/test_mempool.c                | 76 ++++++++++++++++++++++++++
>  doc/guides/rel_notes/release_21_11.rst |  3 +
>  lib/mempool/rte_mempool.c              |  2 +
>  lib/mempool/rte_mempool.h              |  5 ++
>  4 files changed, 86 insertions(+)
> 
> diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
> index bc0cc9ed48..15146dd737 100644
> --- a/app/test/test_mempool.c
> +++ b/app/test/test_mempool.c
> @@ -672,6 +672,74 @@ test_mempool_events_safety(void)
>  	return 0;
>  }
>  
> +static int
> +test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
> +{
> +	struct rte_mempool *mp;
> +	int ret;
> +
> +	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
> +				      MEMPOOL_ELT_SIZE, 0, 0,
> +				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
> +	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
> +				 rte_strerror(rte_errno));
> +	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
> +	ret = rte_mempool_populate_default(mp);
> +	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
> +			rte_strerror(rte_errno));
> +	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
> +			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
> +	rte_mempool_free(mp);
> +	return 0;
> +}

One comment that also applies to the previous patch. Using
RTE_TEST_ASSERT_*() is convenient to test a condition, display an error
message and return on error in one operation. But here it can cause a
leak on test failure.

I don't know what is the best approach to solve the issue. Having
equivalent test macros that do "goto fail" instead of "return -1" would
help here. I mean something like:
  RTE_TEST_ASSERT_GOTO_*(cond, label, fmt, ...)

What do you think?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15 13:19         ` Olivier Matz
@ 2021-10-15 13:27           ` Dmitry Kozlyuk
  2021-10-15 13:43             ` Olivier Matz
  0 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 13:27 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Andrew Rybchenko, Matan Azrad

> [...]
> > +static int
> > +test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
> > +{
> > +     struct rte_mempool *mp;
> > +     int ret;
> > +
> > +     mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
> > +                                   MEMPOOL_ELT_SIZE, 0, 0,
> > +                                   SOCKET_ID_ANY,
> MEMPOOL_F_NO_IOVA_CONTIG);
> > +     RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
> > +                              rte_strerror(rte_errno));
> > +     rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
> > +     ret = rte_mempool_populate_default(mp);
> > +     RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
> > +                     rte_strerror(rte_errno));
> > +     RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
> > +                     "NON_IO flag is not set when NO_IOVA_CONTIG is
> set");
> > +     rte_mempool_free(mp);
> > +     return 0;
> > +}
> 
> One comment that also applies to the previous patch. Using
> RTE_TEST_ASSERT_*() is convenient to test a condition, display an error
> message and return on error in one operation. But here it can cause a
> leak on test failure.
> 
> I don't know what is the best approach to solve the issue. Having
> equivalent test macros that do "goto fail" instead of "return -1" would
> help here. I mean something like:
>   RTE_TEST_ASSERT_GOTO_*(cond, label, fmt, ...)
> 
> What do you think?

This can work with existing macros:

	#define TEST_TRACE_FAILURE(...) goto fail

Because of "trace" in the name it looks a bit like a hijacking.
Probably the macro should be named TEST_HANDLE_FAILURE
to suggest broader usages than just tracing,
but for now it looks the most neat way.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-15 13:07           ` Dmitry Kozlyuk
@ 2021-10-15 13:40             ` Olivier Matz
  0 siblings, 0 replies; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 13:40 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Andrew Rybchenko, Matan Azrad, Ray Kinsella, Anatoly Burakov

On Fri, Oct 15, 2021 at 01:07:42PM +0000, Dmitry Kozlyuk wrote:
[...]
> > > +static void
> > > +mempool_event_callback_invoke(enum rte_mempool_event event,
> > > +                           struct rte_mempool *mp)
> > > +{
> > > +     struct mempool_callback_list *list;
> > > +     struct rte_tailq_entry *te;
> > > +     void *tmp_te;
> > > +
> > > +     rte_mcfg_tailq_read_lock();
> > > +     list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
> > > +     RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
> > > +             struct mempool_callback *cb = te->data;
> > > +             rte_mcfg_tailq_read_unlock();
> > > +             cb->func(event, mp, cb->user_data);
> > > +             rte_mcfg_tailq_read_lock();
> > 
> > I think it is dangerous to unlock the list before invoking the callback.
> > During that time, another thread can remove the next mempool callback, and
> > the next iteration will access to a freed element, causing an undefined
> > behavior.
> > 
> > Is it a problem to keep the lock held during the callback invocation?
> > 
> > I see that you have a test for this, and that you wrote a comment in the
> > documentation:
> > 
> >  * rte_mempool_event_callback_register() may be called from within the
> > callback,
> >  * but the callbacks registered this way will not be invoked for the same
> > event.
> >  * rte_mempool_event_callback_unregister() may only be safely called
> >  * to remove the running callback.
> > 
> > But is there a use-case for this?
> > If no, I'll tend to say that we can document that it is not allowed to
> > create, free or list mempools or register cb from the callback.
> 
> There is no use-case, but I'd argue for releasing the lock.
> This lock is taken by rte_xxx_create() functions in many libraries,
> so the restriction is wider and, worse, it is not strictly limited.

Yes... I honnestly don't understand why every library uses the same
lock rte_mcfg_tailq if the only code that accesses the list is in the
library itself. Maybe I'm missing something.

I have the impression that having only one mempool lock for all usages in
mempool would be simpler and more specific. It would allow to keep the lock held
while invoking the callbacks without blocking accesses to the other libs, and
would also solve the problem described below.



> > [...]
> > > +int
> > > +rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
> > > +                                   void *user_data)
> > > +{
> > > +     struct mempool_callback_list *list;
> > > +     struct rte_tailq_entry *te = NULL;
> > > +     struct mempool_callback *cb;
> > > +     int ret;
> > > +
> > > +     if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> > > +             rte_errno = EPERM;
> > > +             return -1;
> > > +     }
> > 
> > The help of the register function says
> >  * Callbacks will be invoked in the process that creates the mempool.
> 
> BTW, this is another bug, it should be "populates", not "creates".
> 
> > So registration is allowed from primary or secondary process. Can't a
> > secondary process destroys the callback it has loaded?
> > 
> > > +
> > > +     rte_mcfg_mempool_read_lock();
> > > +     rte_mcfg_tailq_write_lock();
> > 
> > I don't understand why there are 2 locks here.
> > 
> > After looking at the code, I think the locking model is already
> > incorrect in current mempool code:
> > 
> >    rte_mcfg_tailq_write_lock() is used in create and free to protect the
> >      access to the mempool tailq
> > 
> >    rte_mcfg_mempool_write_lock() is used in create(), to protect from
> >      concurrent creation (with same name for instance), but I doubt it
> >      is absolutly needed, because memzone_reserve is already protected.
> > 
> >    rte_mcfg_mempool_read_lock() is used in dump functions, but to me
> >      it should use rte_mcfg_tailq_read_lock() instead.
> >      Currently, doing a dump and a free concurrently can cause a crash
> >      because they are not using the same lock.
> > 
> > In your case, I suggest to use only one lock to protect the callback
> > list. I think it can be rte_mcfg_tailq_*_lock().
> 
> Thanks, I will double-check the locking.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15 13:27           ` Dmitry Kozlyuk
@ 2021-10-15 13:43             ` Olivier Matz
  2021-10-19 13:08               ` Dmitry Kozlyuk
  0 siblings, 1 reply; 82+ messages in thread
From: Olivier Matz @ 2021-10-15 13:43 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Andrew Rybchenko, Matan Azrad

On Fri, Oct 15, 2021 at 01:27:59PM +0000, Dmitry Kozlyuk wrote:
> > [...]
> > > +static int
> > > +test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
> > > +{
> > > +     struct rte_mempool *mp;
> > > +     int ret;
> > > +
> > > +     mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
> > > +                                   MEMPOOL_ELT_SIZE, 0, 0,
> > > +                                   SOCKET_ID_ANY,
> > MEMPOOL_F_NO_IOVA_CONTIG);
> > > +     RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
> > > +                              rte_strerror(rte_errno));
> > > +     rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
> > > +     ret = rte_mempool_populate_default(mp);
> > > +     RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
> > > +                     rte_strerror(rte_errno));
> > > +     RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
> > > +                     "NON_IO flag is not set when NO_IOVA_CONTIG is
> > set");
> > > +     rte_mempool_free(mp);
> > > +     return 0;
> > > +}
> > 
> > One comment that also applies to the previous patch. Using
> > RTE_TEST_ASSERT_*() is convenient to test a condition, display an error
> > message and return on error in one operation. But here it can cause a
> > leak on test failure.
> > 
> > I don't know what is the best approach to solve the issue. Having
> > equivalent test macros that do "goto fail" instead of "return -1" would
> > help here. I mean something like:
> >   RTE_TEST_ASSERT_GOTO_*(cond, label, fmt, ...)
> > 
> > What do you think?
> 
> This can work with existing macros:
> 
> 	#define TEST_TRACE_FAILURE(...) goto fail
> 
> Because of "trace" in the name it looks a bit like a hijacking.
> Probably the macro should be named TEST_HANDLE_FAILURE
> to suggest broader usages than just tracing,
> but for now it looks the most neat way.

That would work for me.
What about introducing another macro for this usage, that would
be "return -1" by default and that can be overridden?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit mempool registration
  2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                         ` (3 preceding siblings ...)
  2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-15 16:02       ` Dmitry Kozlyuk
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks Dmitry Kozlyuk
                           ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 16:02 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v5:
    1. Change non-IO flag inference + various fixes (Andrew).
    1. Fix callback unregistration from secondary processes (Olivier).
    2. Support non-IO flag in proc-dump (David).
    3. Fix the usage of locks (Olivier).
    4. Avoid resource leaks in unit test (Olivier).
v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.


Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/proc-info/main.c                   |   4 +-
 app/test/test_mempool.c                | 360 +++++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 134 ++++++
 lib/mempool/rte_mempool.h              |  64 +++
 lib/mempool/version.map                |   8 +
 22 files changed, 1585 insertions(+), 117 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-15 16:02         ` Dmitry Kozlyuk
  2021-10-20  9:29           ` Kinsella, Ray
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 16:02 UTC (permalink / raw)
  To: dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Olivier Matz,
	Ray Kinsella

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test/test_mempool.c   | 248 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 124 +++++++++++++++++++
 lib/mempool/rte_mempool.h |  62 ++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 442 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 66bc8d86b7..c39c83256e 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -489,6 +490,245 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { goto fail; } while (0)
+
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM], *freed;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	memset(mp, 0, sizeof(mp));
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		/*
+		 * Save pointer to check that it was passed to the callback,
+		 * but put NULL into the array in case cleanup is called early.
+		 */
+		freed = mp[i];
+		mp[i] = NULL;
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, freed,
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return TEST_SUCCESS;
+
+fail:
+	for (j = 0; j < CB_NUM; j++)
+		rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+	for (i = 0; i < MP_NUM; i++)
+		rte_mempool_free(mp[i]);
+	return TEST_FAILED;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	mp = NULL;
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	ret = TEST_SUCCESS;
+
+exit:
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	/* in case of failure before the planned destruction */
+	rte_mempool_free(mp);
+	return ret;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
 static int
 test_mempool(void)
 {
@@ -666,6 +906,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 607419ccaf..8810d08ab5 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1356,3 +1373,110 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback_data {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback_data *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("mempool_cb_tail_entry", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("mempool_cb_data", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	int ret = -ENOENT;
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			TAILQ_REMOVE(list, te, next);
+			ret = 0;
+			break;
+		}
+	}
+	rte_mcfg_tailq_write_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 88bcbc51ef..3285626712 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1769,6 +1769,68 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * @internal
+ * Mempool event type.
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is fully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before the destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback function invoked on mempool life cycle event.
+ * The function will be invoked in the process
+ * that performs an action which triggers the callback.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v5 2/4] mempool: add non-IO flag
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-15 16:02         ` Dmitry Kozlyuk
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 16:02 UTC (permalink / raw)
  To: dev
  Cc: Matan Azrad, Andrew Rybchenko, Maryam Tahhan, Reshma Pattan,
	Olivier Matz

Mempool is a generic allocator that is not necessarily used
for device IO operations and its memory for DMA.
Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
a) if their objects are not contiguous;
b) if IOVA is not available for any object.
Other components can inspect this flag
in order to optimize their memory management.

Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/proc-info/main.c                   |   4 +-
 app/test/test_mempool.c                | 112 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |   3 +
 lib/mempool/rte_mempool.c              |  10 +++
 lib/mempool/rte_mempool.h              |   2 +
 5 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/app/proc-info/main.c b/app/proc-info/main.c
index a8e928fa9f..6054cb3d88 100644
--- a/app/proc-info/main.c
+++ b/app/proc-info/main.c
@@ -1296,6 +1296,7 @@ show_mempool(char *name)
 				"\t  -- SP put (%c), SC get (%c)\n"
 				"\t  -- Pool created (%c)\n"
 				"\t  -- No IOVA config (%c)\n",
+				"\t  -- Not used for IO (%c)\n",
 				ptr->name,
 				ptr->socket_id,
 				(flags & MEMPOOL_F_NO_SPREAD) ? 'y' : 'n',
@@ -1303,7 +1304,8 @@ show_mempool(char *name)
 				(flags & MEMPOOL_F_SP_PUT) ? 'y' : 'n',
 				(flags & MEMPOOL_F_SC_GET) ? 'y' : 'n',
 				(flags & MEMPOOL_F_POOL_CREATED) ? 'y' : 'n',
-				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n');
+				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n',
+				(flags & MEMPOOL_F_NON_IO) ? 'y' : 'n');
 			printf("  - Size %u Cache %u element %u\n"
 				"  - header %u trailer %u\n"
 				"  - private data size %u\n",
diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index c39c83256e..caf9c46a29 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -12,6 +12,7 @@
 #include <sys/queue.h>
 
 #include <rte_common.h>
+#include <rte_eal_paging.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_errno.h>
@@ -729,6 +730,109 @@ test_mempool_events_safety(void)
 #pragma pop_macro("RTE_TEST_TRACE_FAILURE")
 }
 
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_when_populated_with_valid_iova(void)
+{
+	const struct rte_memzone *mz;
+	void *virt;
+	rte_iova_t iova;
+	size_t page_size = RTE_PGSIZE_2M;
+	struct rte_mempool *mp;
+	int ret;
+
+	mz = rte_memzone_reserve("test_mempool", 3 * page_size, SOCKET_ID_ANY,
+				 RTE_MEMZONE_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mz, "Cannot allocate memory");
+	virt = mz->addr;
+	iova = rte_mem_virt2iova(virt);
+	RTE_TEST_ASSERT_NOT_EQUAL(iova,  RTE_BAD_IOVA, "Cannot get IOVA");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 1 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with only RTE_BAD_IOVA");
+
+	ret = rte_mempool_populate_iova(mp, virt, iova, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is not unset when mempool is populated with valid IOVA");
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 2 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set even when some objects have valid IOVA");
+	ret = TEST_SUCCESS;
+
+exit:
+	rte_mempool_free(mp);
+	rte_memzone_free(mz);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+
 static int
 test_mempool(void)
 {
@@ -914,6 +1018,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_unset_when_populated_with_valid_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 4c56cdfeaa..39a8a3d950 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -229,6 +229,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8810d08ab5..7d7d97d85d 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -372,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* At least some objects in the pool can now be used for IO. */
+	if (iova != RTE_BAD_IOVA)
+		mp->flags &= ~MEMPOOL_F_NON_IO;
+
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
 		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
@@ -851,6 +855,12 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 		return NULL;
 	}
 
+	/*
+	 * No objects in the pool can be used for IO until it's populated
+	 * with at least some objects with valid IOVA.
+	 */
+	flags |= MEMPOOL_F_NON_IO;
+
 	/* "no cache align" imply "no spread" */
 	if (flags & MEMPOOL_F_NO_CACHE_ALIGN)
 		flags |= MEMPOOL_F_NO_SPREAD;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 3285626712..408d916a9c 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -257,6 +257,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+/** Internal: no object from the pool can be used for device IO (DMA). */
+#define MEMPOOL_F_NON_IO         0x0040
 
 /**
  * @internal When debug is enabled, store some statistics.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-15 16:02         ` Dmitry Kozlyuk
  2021-10-20  9:30           ` Kinsella, Ray
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 16:02 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v5 4/4] net/mlx5: support mempool registration
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                           ` (2 preceding siblings ...)
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-15 16:02         ` Dmitry Kozlyuk
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-15 16:02 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 39a8a3d950..f141999a0d 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -159,6 +159,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit mempool registration
  2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                           ` (3 preceding siblings ...)
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-16 20:00         ` Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 1/4] mempool: add event callbacks Dmitry Kozlyuk
                             ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-16 20:00 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v6:
    Fix compilation issue in proc-info (CI).
v5:
    1. Change non-IO flag inference + various fixes (Andrew).
    1. Fix callback unregistration from secondary processes (Olivier).
    2. Support non-IO flag in proc-dump (David).
    3. Fix the usage of locks (Olivier).
    4. Avoid resource leaks in unit test (Olivier).
v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 360 +++++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 134 ++++++
 lib/mempool/rte_mempool.h              |  64 +++
 lib/mempool/version.map                |   8 +
 22 files changed, 1586 insertions(+), 118 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v6 1/4] mempool: add event callbacks
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-16 20:00           ` Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-16 20:00 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Andrew Rybchenko, Olivier Matz, Ray Kinsella

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test/test_mempool.c   | 248 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 124 +++++++++++++++++++
 lib/mempool/rte_mempool.h |  62 ++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 442 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 66bc8d86b7..c39c83256e 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -489,6 +490,245 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { goto fail; } while (0)
+
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM], *freed;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	memset(mp, 0, sizeof(mp));
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		/*
+		 * Save pointer to check that it was passed to the callback,
+		 * but put NULL into the array in case cleanup is called early.
+		 */
+		freed = mp[i];
+		mp[i] = NULL;
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, freed,
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return TEST_SUCCESS;
+
+fail:
+	for (j = 0; j < CB_NUM; j++)
+		rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+	for (i = 0; i < MP_NUM; i++)
+		rte_mempool_free(mp[i]);
+	return TEST_FAILED;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	mp = NULL;
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	ret = TEST_SUCCESS;
+
+exit:
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	/* in case of failure before the planned destruction */
+	rte_mempool_free(mp);
+	return ret;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
 static int
 test_mempool(void)
 {
@@ -666,6 +906,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 607419ccaf..8810d08ab5 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1356,3 +1373,110 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback_data {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback_data *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("mempool_cb_tail_entry", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("mempool_cb_data", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	int ret = -ENOENT;
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			TAILQ_REMOVE(list, te, next);
+			ret = 0;
+			break;
+		}
+	}
+	rte_mcfg_tailq_write_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 88bcbc51ef..3285626712 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1769,6 +1769,68 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * @internal
+ * Mempool event type.
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is fully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before the destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback function invoked on mempool life cycle event.
+ * The function will be invoked in the process
+ * that performs an action which triggers the callback.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v6 2/4] mempool: add non-IO flag
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-16 20:00           ` Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-16 20:00 UTC (permalink / raw)
  To: dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Maryam Tahhan,
	Reshma Pattan, Olivier Matz

Mempool is a generic allocator that is not necessarily used
for device IO operations and its memory for DMA.
Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
a) if their objects are not contiguous;
b) if IOVA is not available for any object.
Other components can inspect this flag
in order to optimize their memory management.

Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 112 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |   3 +
 lib/mempool/rte_mempool.c              |  10 +++
 lib/mempool/rte_mempool.h              |   2 +
 5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/app/proc-info/main.c b/app/proc-info/main.c
index a8e928fa9f..8ec9cadd79 100644
--- a/app/proc-info/main.c
+++ b/app/proc-info/main.c
@@ -1295,7 +1295,8 @@ show_mempool(char *name)
 				"\t  -- No cache align (%c)\n"
 				"\t  -- SP put (%c), SC get (%c)\n"
 				"\t  -- Pool created (%c)\n"
-				"\t  -- No IOVA config (%c)\n",
+				"\t  -- No IOVA config (%c)\n"
+				"\t  -- Not used for IO (%c)\n",
 				ptr->name,
 				ptr->socket_id,
 				(flags & MEMPOOL_F_NO_SPREAD) ? 'y' : 'n',
@@ -1303,7 +1304,8 @@ show_mempool(char *name)
 				(flags & MEMPOOL_F_SP_PUT) ? 'y' : 'n',
 				(flags & MEMPOOL_F_SC_GET) ? 'y' : 'n',
 				(flags & MEMPOOL_F_POOL_CREATED) ? 'y' : 'n',
-				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n');
+				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n',
+				(flags & MEMPOOL_F_NON_IO) ? 'y' : 'n');
 			printf("  - Size %u Cache %u element %u\n"
 				"  - header %u trailer %u\n"
 				"  - private data size %u\n",
diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index c39c83256e..caf9c46a29 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -12,6 +12,7 @@
 #include <sys/queue.h>
 
 #include <rte_common.h>
+#include <rte_eal_paging.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_errno.h>
@@ -729,6 +730,109 @@ test_mempool_events_safety(void)
 #pragma pop_macro("RTE_TEST_TRACE_FAILURE")
 }
 
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_when_populated_with_valid_iova(void)
+{
+	const struct rte_memzone *mz;
+	void *virt;
+	rte_iova_t iova;
+	size_t page_size = RTE_PGSIZE_2M;
+	struct rte_mempool *mp;
+	int ret;
+
+	mz = rte_memzone_reserve("test_mempool", 3 * page_size, SOCKET_ID_ANY,
+				 RTE_MEMZONE_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mz, "Cannot allocate memory");
+	virt = mz->addr;
+	iova = rte_mem_virt2iova(virt);
+	RTE_TEST_ASSERT_NOT_EQUAL(iova,  RTE_BAD_IOVA, "Cannot get IOVA");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 1 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with only RTE_BAD_IOVA");
+
+	ret = rte_mempool_populate_iova(mp, virt, iova, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is not unset when mempool is populated with valid IOVA");
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 2 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set even when some objects have valid IOVA");
+	ret = TEST_SUCCESS;
+
+exit:
+	rte_mempool_free(mp);
+	rte_memzone_free(mz);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+
 static int
 test_mempool(void)
 {
@@ -914,6 +1018,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_unset_when_populated_with_valid_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 4c56cdfeaa..39a8a3d950 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -229,6 +229,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8810d08ab5..7d7d97d85d 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -372,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* At least some objects in the pool can now be used for IO. */
+	if (iova != RTE_BAD_IOVA)
+		mp->flags &= ~MEMPOOL_F_NON_IO;
+
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
 		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
@@ -851,6 +855,12 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 		return NULL;
 	}
 
+	/*
+	 * No objects in the pool can be used for IO until it's populated
+	 * with at least some objects with valid IOVA.
+	 */
+	flags |= MEMPOOL_F_NON_IO;
+
 	/* "no cache align" imply "no spread" */
 	if (flags & MEMPOOL_F_NO_CACHE_ALIGN)
 		flags |= MEMPOOL_F_NO_SPREAD;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 3285626712..408d916a9c 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -257,6 +257,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+/** Internal: no object from the pool can be used for device IO (DMA). */
+#define MEMPOOL_F_NON_IO         0x0040
 
 /**
  * @internal When debug is enabled, store some statistics.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v6 3/4] common/mlx5: add mempool registration facilities
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-16 20:00           ` Dmitry Kozlyuk
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-16 20:00 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v6 4/4] net/mlx5: support mempool registration
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                             ` (2 preceding siblings ...)
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-16 20:00           ` Dmitry Kozlyuk
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-16 20:00 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 39a8a3d950..f141999a0d 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -159,6 +159,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit mempool registration
  2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                             ` (3 preceding siblings ...)
  2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-18 10:01           ` Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 1/4] mempool: add event callbacks Dmitry Kozlyuk
                               ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 10:01 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v7 (internal CI):
    1. Fix unit test compilation issues with GCC.
    2. Keep rte_mempool_event description non-internal: Doxygen treats
       it as not documented otherwise, "doc" target fails.
v6:
    Fix compilation issue in proc-info (CI).
v5:
    1. Change non-IO flag inference + various fixes (Andrew).
    2. Fix callback unregistration from secondary processes (Olivier).
    3. Support non-IO flag in proc-dump (David).
    4. Fix the usage of locks (Olivier).
    5. Avoid resource leaks in unit test (Olivier).
v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 360 +++++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 134 ++++++
 lib/mempool/rte_mempool.h              |  64 +++
 lib/mempool/version.map                |   8 +
 22 files changed, 1586 insertions(+), 118 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v7 1/4] mempool: add event callbacks
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-18 10:01             ` Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 10:01 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Andrew Rybchenko, Olivier Matz, Ray Kinsella

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test/test_mempool.c   | 248 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 124 +++++++++++++++++++
 lib/mempool/rte_mempool.h |  62 ++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 442 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 66bc8d86b7..c39c83256e 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -489,6 +490,245 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { goto fail; } while (0)
+
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM], *freed;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	memset(mp, 0, sizeof(mp));
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		/*
+		 * Save pointer to check that it was passed to the callback,
+		 * but put NULL into the array in case cleanup is called early.
+		 */
+		freed = mp[i];
+		mp[i] = NULL;
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, freed,
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return TEST_SUCCESS;
+
+fail:
+	for (j = 0; j < CB_NUM; j++)
+		rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+	for (i = 0; i < MP_NUM; i++)
+		rte_mempool_free(mp[i]);
+	return TEST_FAILED;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	mp = NULL;
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	ret = TEST_SUCCESS;
+
+exit:
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	/* in case of failure before the planned destruction */
+	rte_mempool_free(mp);
+	return ret;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
 static int
 test_mempool(void)
 {
@@ -666,6 +906,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 607419ccaf..8810d08ab5 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1356,3 +1373,110 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback_data {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback_data *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("mempool_cb_tail_entry", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("mempool_cb_data", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	int ret = -ENOENT;
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			TAILQ_REMOVE(list, te, next);
+			ret = 0;
+			break;
+		}
+	}
+	rte_mcfg_tailq_write_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 88bcbc51ef..5799d4a705 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1769,6 +1769,68 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is fully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before the destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback function invoked on mempool life cycle event.
+ * The function will be invoked in the process
+ * that performs an action which triggers the callback.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v7 2/4] mempool: add non-IO flag
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-18 10:01             ` Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 10:01 UTC (permalink / raw)
  To: dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Maryam Tahhan,
	Reshma Pattan, Olivier Matz

Mempool is a generic allocator that is not necessarily used
for device IO operations and its memory for DMA.
Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
a) if their objects are not contiguous;
b) if IOVA is not available for any object.
Other components can inspect this flag
in order to optimize their memory management.

Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 112 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |   3 +
 lib/mempool/rte_mempool.c              |  10 +++
 lib/mempool/rte_mempool.h              |   2 +
 5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/app/proc-info/main.c b/app/proc-info/main.c
index a8e928fa9f..8ec9cadd79 100644
--- a/app/proc-info/main.c
+++ b/app/proc-info/main.c
@@ -1295,7 +1295,8 @@ show_mempool(char *name)
 				"\t  -- No cache align (%c)\n"
 				"\t  -- SP put (%c), SC get (%c)\n"
 				"\t  -- Pool created (%c)\n"
-				"\t  -- No IOVA config (%c)\n",
+				"\t  -- No IOVA config (%c)\n"
+				"\t  -- Not used for IO (%c)\n",
 				ptr->name,
 				ptr->socket_id,
 				(flags & MEMPOOL_F_NO_SPREAD) ? 'y' : 'n',
@@ -1303,7 +1304,8 @@ show_mempool(char *name)
 				(flags & MEMPOOL_F_SP_PUT) ? 'y' : 'n',
 				(flags & MEMPOOL_F_SC_GET) ? 'y' : 'n',
 				(flags & MEMPOOL_F_POOL_CREATED) ? 'y' : 'n',
-				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n');
+				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n',
+				(flags & MEMPOOL_F_NON_IO) ? 'y' : 'n');
 			printf("  - Size %u Cache %u element %u\n"
 				"  - header %u trailer %u\n"
 				"  - private data size %u\n",
diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index c39c83256e..9136e17374 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -12,6 +12,7 @@
 #include <sys/queue.h>
 
 #include <rte_common.h>
+#include <rte_eal_paging.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_errno.h>
@@ -729,6 +730,109 @@ test_mempool_events_safety(void)
 #pragma pop_macro("RTE_TEST_TRACE_FAILURE")
 }
 
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_when_populated_with_valid_iova(void)
+{
+	const struct rte_memzone *mz;
+	void *virt;
+	rte_iova_t iova;
+	size_t page_size = RTE_PGSIZE_2M;
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mz = rte_memzone_reserve("test_mempool", 3 * page_size, SOCKET_ID_ANY,
+				 RTE_MEMZONE_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mz, "Cannot allocate memory");
+	virt = mz->addr;
+	iova = rte_mem_virt2iova(virt);
+	RTE_TEST_ASSERT_NOT_EQUAL(iova,  RTE_BAD_IOVA, "Cannot get IOVA");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 1 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with only RTE_BAD_IOVA");
+
+	ret = rte_mempool_populate_iova(mp, virt, iova, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is not unset when mempool is populated with valid IOVA");
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 2 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set even when some objects have valid IOVA");
+	ret = TEST_SUCCESS;
+
+exit:
+	rte_mempool_free(mp);
+	rte_memzone_free(mz);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+
 static int
 test_mempool(void)
 {
@@ -914,6 +1018,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_unset_when_populated_with_valid_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 4c56cdfeaa..39a8a3d950 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -229,6 +229,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8810d08ab5..7d7d97d85d 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -372,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* At least some objects in the pool can now be used for IO. */
+	if (iova != RTE_BAD_IOVA)
+		mp->flags &= ~MEMPOOL_F_NON_IO;
+
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
 		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
@@ -851,6 +855,12 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 		return NULL;
 	}
 
+	/*
+	 * No objects in the pool can be used for IO until it's populated
+	 * with at least some objects with valid IOVA.
+	 */
+	flags |= MEMPOOL_F_NON_IO;
+
 	/* "no cache align" imply "no spread" */
 	if (flags & MEMPOOL_F_NO_CACHE_ALIGN)
 		flags |= MEMPOOL_F_NO_SPREAD;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 5799d4a705..b2e20c8855 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -257,6 +257,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+/** Internal: no object from the pool can be used for device IO (DMA). */
+#define MEMPOOL_F_NON_IO         0x0040
 
 /**
  * @internal When debug is enabled, store some statistics.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v7 3/4] common/mlx5: add mempool registration facilities
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-18 10:01             ` Dmitry Kozlyuk
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 10:01 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v7 4/4] net/mlx5: support mempool registration
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                               ` (2 preceding siblings ...)
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-18 10:01             ` Dmitry Kozlyuk
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 10:01 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 39a8a3d950..f141999a0d 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -159,6 +159,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit mempool registration
  2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                               ` (3 preceding siblings ...)
  2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-18 14:40             ` Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 1/4] mempool: add event callbacks Dmitry Kozlyuk
                                 ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 14:40 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v8:
    Fix mempool_autotest failure on Ubuntu 18.04 (CI).
v7 (internal CI):
    1. Fix unit test compilation issues with GCC.
    2. Keep rte_mempool_event description non-internal: Doxygen treats
       it as not documented otherwise, "doc" target fails.
v6:
    Fix compilation issue in proc-info (CI).
v5:
    1. Change non-IO flag inference + various fixes (Andrew).
    2. Fix callback unregistration from secondary processes (Olivier).
    3. Support non-IO flag in proc-dump (David).
    4. Fix the usage of locks (Olivier).
    5. Avoid resource leaks in unit test (Olivier).
v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.


Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 362 +++++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 134 ++++++
 lib/mempool/rte_mempool.h              |  64 +++
 lib/mempool/version.map                |   8 +
 22 files changed, 1588 insertions(+), 118 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v8 1/4] mempool: add event callbacks
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-18 14:40               ` Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 14:40 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Andrew Rybchenko, Olivier Matz, Ray Kinsella

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test/test_mempool.c   | 248 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 124 +++++++++++++++++++
 lib/mempool/rte_mempool.h |  62 ++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 442 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 66bc8d86b7..c39c83256e 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -489,6 +490,245 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { goto fail; } while (0)
+
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM], *freed;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	memset(mp, 0, sizeof(mp));
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		/*
+		 * Save pointer to check that it was passed to the callback,
+		 * but put NULL into the array in case cleanup is called early.
+		 */
+		freed = mp[i];
+		mp[i] = NULL;
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, freed,
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return TEST_SUCCESS;
+
+fail:
+	for (j = 0; j < CB_NUM; j++)
+		rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+	for (i = 0; i < MP_NUM; i++)
+		rte_mempool_free(mp[i]);
+	return TEST_FAILED;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	mp = NULL;
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	ret = TEST_SUCCESS;
+
+exit:
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	/* in case of failure before the planned destruction */
+	rte_mempool_free(mp);
+	return ret;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
 static int
 test_mempool(void)
 {
@@ -666,6 +906,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 607419ccaf..8810d08ab5 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1356,3 +1373,110 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback_data {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback_data *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("mempool_cb_tail_entry", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("mempool_cb_data", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	int ret = -ENOENT;
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			TAILQ_REMOVE(list, te, next);
+			ret = 0;
+			break;
+		}
+	}
+	rte_mcfg_tailq_write_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 88bcbc51ef..5799d4a705 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1769,6 +1769,68 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is fully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before the destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback function invoked on mempool life cycle event.
+ * The function will be invoked in the process
+ * that performs an action which triggers the callback.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-18 14:40               ` Dmitry Kozlyuk
  2021-10-29  3:30                 ` Jiang, YuX
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 14:40 UTC (permalink / raw)
  To: dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Maryam Tahhan,
	Reshma Pattan, Olivier Matz

Mempool is a generic allocator that is not necessarily used
for device IO operations and its memory for DMA.
Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
a) if their objects are not contiguous;
b) if IOVA is not available for any object.
Other components can inspect this flag
in order to optimize their memory management.

Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 114 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |   3 +
 lib/mempool/rte_mempool.c              |  10 +++
 lib/mempool/rte_mempool.h              |   2 +
 5 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/app/proc-info/main.c b/app/proc-info/main.c
index a8e928fa9f..8ec9cadd79 100644
--- a/app/proc-info/main.c
+++ b/app/proc-info/main.c
@@ -1295,7 +1295,8 @@ show_mempool(char *name)
 				"\t  -- No cache align (%c)\n"
 				"\t  -- SP put (%c), SC get (%c)\n"
 				"\t  -- Pool created (%c)\n"
-				"\t  -- No IOVA config (%c)\n",
+				"\t  -- No IOVA config (%c)\n"
+				"\t  -- Not used for IO (%c)\n",
 				ptr->name,
 				ptr->socket_id,
 				(flags & MEMPOOL_F_NO_SPREAD) ? 'y' : 'n',
@@ -1303,7 +1304,8 @@ show_mempool(char *name)
 				(flags & MEMPOOL_F_SP_PUT) ? 'y' : 'n',
 				(flags & MEMPOOL_F_SC_GET) ? 'y' : 'n',
 				(flags & MEMPOOL_F_POOL_CREATED) ? 'y' : 'n',
-				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n');
+				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n',
+				(flags & MEMPOOL_F_NON_IO) ? 'y' : 'n');
 			printf("  - Size %u Cache %u element %u\n"
 				"  - header %u trailer %u\n"
 				"  - private data size %u\n",
diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index c39c83256e..81800b7122 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -12,6 +12,7 @@
 #include <sys/queue.h>
 
 #include <rte_common.h>
+#include <rte_eal_paging.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_errno.h>
@@ -729,6 +730,111 @@ test_mempool_events_safety(void)
 #pragma pop_macro("RTE_TEST_TRACE_FAILURE")
 }
 
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_when_populated_with_valid_iova(void)
+{
+	void *virt = NULL;
+	rte_iova_t iova;
+	size_t page_size = RTE_PGSIZE_2M;
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	/*
+	 * Since objects from the pool are never used in the test,
+	 * we don't care for contiguous IOVA, on the other hand,
+	 * reiuring it could cause spurious test failures.
+	 */
+	virt = rte_malloc("test_mempool", 3 * page_size, page_size);
+	RTE_TEST_ASSERT_NOT_NULL(virt, "Cannot allocate memory");
+	iova = rte_mem_virt2iova(virt);
+	RTE_TEST_ASSERT_NOT_EQUAL(iova,  RTE_BAD_IOVA, "Cannot get IOVA");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 1 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with only RTE_BAD_IOVA");
+
+	ret = rte_mempool_populate_iova(mp, virt, iova, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is not unset when mempool is populated with valid IOVA");
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 2 * page_size),
+					RTE_BAD_IOVA, page_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set even when some objects have valid IOVA");
+	ret = TEST_SUCCESS;
+
+exit:
+	rte_mempool_free(mp);
+	rte_free(virt);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+
 static int
 test_mempool(void)
 {
@@ -914,6 +1020,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_unset_when_populated_with_valid_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d5435a64aa..f6bb5adeff 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -237,6 +237,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8810d08ab5..7d7d97d85d 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -372,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* At least some objects in the pool can now be used for IO. */
+	if (iova != RTE_BAD_IOVA)
+		mp->flags &= ~MEMPOOL_F_NON_IO;
+
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
 		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
@@ -851,6 +855,12 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 		return NULL;
 	}
 
+	/*
+	 * No objects in the pool can be used for IO until it's populated
+	 * with at least some objects with valid IOVA.
+	 */
+	flags |= MEMPOOL_F_NON_IO;
+
 	/* "no cache align" imply "no spread" */
 	if (flags & MEMPOOL_F_NO_CACHE_ALIGN)
 		flags |= MEMPOOL_F_NO_SPREAD;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 5799d4a705..b2e20c8855 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -257,6 +257,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+/** Internal: no object from the pool can be used for device IO (DMA). */
+#define MEMPOOL_F_NON_IO         0x0040
 
 /**
  * @internal When debug is enabled, store some statistics.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v8 3/4] common/mlx5: add mempool registration facilities
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-18 14:40               ` Dmitry Kozlyuk
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 14:40 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v8 4/4] net/mlx5: support mempool registration
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                                 ` (2 preceding siblings ...)
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-18 14:40               ` Dmitry Kozlyuk
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 14:40 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index f6bb5adeff..4d3374f7e7 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -167,6 +167,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit mempool registration
  2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                                 ` (3 preceding siblings ...)
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-18 22:43               ` Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 1/4] mempool: add event callbacks Dmitry Kozlyuk
                                   ` (4 more replies)
  4 siblings, 5 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 22:43 UTC (permalink / raw)
  To: dev

MLX5 hardware has its internal IOMMU where PMD registers the memory.
On the data path, PMD translates VA into a key consumed by the device
IOMMU.  It is impractical for the PMD to register all allocated memory
because of increased lookup cost both in HW and SW.  Most often mbuf
memory comes from mempools, so if PMD tracks them, it can almost always
have mbuf memory registered before an mbuf hits the PMD. This patchset
adds such tracking in the PMD and internal API to support it.

Please see [1] for the discussion of the patch 2/4
and how it can be useful outside of the MLX5 PMD.

[1]: http://inbox.dpdk.org/dev/CH0PR12MB509112FADB778AB28AF3771DB9F99@CH0PR12MB5091.namprd12.prod.outlook.com/

v9 (CI):
    Fix another failure of mempool_autotest.
    Use negative result of rte_mempool_populate_iova()
    to report errors in unit tests instead of rte_errno.
v8:
    Fix mempool_autotest failure on Ubuntu 18.04 (CI).
v7 (internal CI):
    1. Fix unit test compilation issues with GCC.
    2. Keep rte_mempool_event description non-internal: Doxygen treats
       it as not documented otherwise, "doc" target fails.
v6:
    Fix compilation issue in proc-info (CI).
v5:
    1. Change non-IO flag inference + various fixes (Andrew).
    2. Fix callback unregistration from secondary processes (Olivier).
    3. Support non-IO flag in proc-dump (David).
    4. Fix the usage of locks (Olivier).
    5. Avoid resource leaks in unit test (Olivier).
v4: (Andrew)
    1. Improve mempool event callbacks unit tests and documentation.
    2. Make MEMPOOL_F_NON_IO internal and automatically inferred.
       Add unit tests for the inference logic.
v3: Improve wording and naming; fix typos (Thomas).
v2 (internal review and testing):
    1. Change tracked mempool event from being created (CREATE) to being
       fully populated (READY), which is the state PMD is interested in.
    2. Unit test the new mempool callback API.
    3. Remove bogus "error" messages in normal conditions.
    4. Fixes in PMD.

Dmitry Kozlyuk (4):
  mempool: add event callbacks
  mempool: add non-IO flag
  common/mlx5: add mempool registration facilities
  net/mlx5: support mempool registration

 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 363 ++++++++++++++++
 doc/guides/nics/mlx5.rst               |  13 +
 doc/guides/rel_notes/release_21_11.rst |   9 +
 drivers/common/mlx5/mlx5_common_mp.c   |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h   |  14 +
 drivers/common/mlx5/mlx5_common_mr.c   | 580 +++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h   |  17 +
 drivers/common/mlx5/version.map        |   5 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 ++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++
 drivers/net/mlx5/mlx5.h                |  10 +
 drivers/net/mlx5/mlx5_mr.c             | 120 ++---
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 +-
 drivers/net/mlx5/mlx5_rxq.c            |  13 +
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++-
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 lib/mempool/rte_mempool.c              | 134 ++++++
 lib/mempool/rte_mempool.h              |  64 +++
 lib/mempool/version.map                |   8 +
 22 files changed, 1589 insertions(+), 118 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v9 1/4] mempool: add event callbacks
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
@ 2021-10-18 22:43                 ` Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 2/4] mempool: add non-IO flag Dmitry Kozlyuk
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 22:43 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Andrew Rybchenko, Olivier Matz, Ray Kinsella

Data path performance can benefit if the PMD knows which memory it will
need to handle in advance, before the first mbuf is sent to the PMD.
It is impractical, however, to consider all allocated memory for this
purpose. Most often mbuf memory comes from mempools that can come and
go. PMD can enumerate existing mempools on device start, but it also
needs to track creation and destruction of mempools after the forwarding
starts but before an mbuf from the new mempool is sent to the device.

Add an API to register callback for mempool life cycle events:
* rte_mempool_event_callback_register()
* rte_mempool_event_callback_unregister()
Currently tracked events are:
* RTE_MEMPOOL_EVENT_READY (after populating a mempool)
* RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
Provide a unit test for the new API.
The new API is internal, because it is primarily demanded by PMDs that
may need to deal with any mempools and do not control their creation,
while an application, on the other hand, knows which mempools it creates
and doesn't care about internal mempools PMDs might create.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test/test_mempool.c   | 248 ++++++++++++++++++++++++++++++++++++++
 lib/mempool/rte_mempool.c | 124 +++++++++++++++++++
 lib/mempool/rte_mempool.h |  62 ++++++++++
 lib/mempool/version.map   |   8 ++
 4 files changed, 442 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 66bc8d86b7..5339a4cbd8 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -14,6 +14,7 @@
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_debug.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_launch.h>
 #include <rte_cycles.h>
@@ -489,6 +490,245 @@ test_mp_mem_init(struct rte_mempool *mp,
 	data->ret = 0;
 }
 
+struct test_mempool_events_data {
+	struct rte_mempool *mp;
+	enum rte_mempool_event event;
+	bool invoked;
+};
+
+static void
+test_mempool_events_cb(enum rte_mempool_event event,
+		       struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_data *data = user_data;
+
+	data->mp = mp;
+	data->event = event;
+	data->invoked = true;
+}
+
+static int
+test_mempool_events(int (*populate)(struct rte_mempool *mp))
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { goto fail; } while (0)
+
+	static const size_t CB_NUM = 3;
+	static const size_t MP_NUM = 2;
+
+	struct test_mempool_events_data data[CB_NUM];
+	struct rte_mempool *mp[MP_NUM], *freed;
+	char name[RTE_MEMPOOL_NAMESIZE];
+	size_t i, j;
+	int ret;
+
+	memset(mp, 0, sizeof(mp));
+	for (i = 0; i < CB_NUM; i++) {
+		ret = rte_mempool_event_callback_register
+				(test_mempool_events_cb, &data[i]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to register the callback %zu: %s",
+				      i, rte_strerror(rte_errno));
+	}
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb, mp);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback");
+	/* NULL argument has no special meaning in this API. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    NULL);
+	RTE_TEST_ASSERT_NOT_EQUAL(ret, 0, "Unregistered a non-registered callback with NULL argument");
+
+	/* Create mempool 0 that will be observed by all callbacks. */
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty0");
+	mp[0] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[0], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	for (j = 0; j < CB_NUM; j++)
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, false,
+				      "Callback %zu invoked on %s mempool creation",
+				      j, name);
+
+	rte_mempool_set_ops_byname(mp[0], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[0]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(-ret));
+	for (j = 0; j < CB_NUM; j++) {
+		RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					"Callback %zu not invoked on mempool %s population",
+					j, name);
+		RTE_TEST_ASSERT_EQUAL(data[j].event,
+					RTE_MEMPOOL_EVENT_READY,
+					"Wrong callback invoked, expected READY");
+		RTE_TEST_ASSERT_EQUAL(data[j].mp, mp[0],
+					"Callback %zu invoked for a wrong mempool instead of %s",
+					j, name);
+	}
+
+	/* Check that unregistered callback 0 observes no events. */
+	ret = rte_mempool_event_callback_unregister(test_mempool_events_cb,
+						    &data[0]);
+	RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister callback 0: %s",
+			      rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	strcpy(name, "empty1");
+	mp[1] = rte_mempool_create_empty(name, MEMPOOL_SIZE,
+					 MEMPOOL_ELT_SIZE, 0, 0,
+					 SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp[1], "Cannot create mempool %s: %s",
+				 name, rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp[1], rte_mbuf_best_mempool_ops(), NULL);
+	ret = populate(mp[1]);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp[1]->size, "Failed to populate mempool %s: %s",
+			      name, rte_strerror(-ret));
+	RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+			      "Unregistered callback 0 invoked on %s mempool populaton",
+			      name);
+
+	for (i = 0; i < MP_NUM; i++) {
+		memset(&data, 0, sizeof(data));
+		sprintf(name, "empty%zu", i);
+		rte_mempool_free(mp[i]);
+		/*
+		 * Save pointer to check that it was passed to the callback,
+		 * but put NULL into the array in case cleanup is called early.
+		 */
+		freed = mp[i];
+		mp[i] = NULL;
+		for (j = 1; j < CB_NUM; j++) {
+			RTE_TEST_ASSERT_EQUAL(data[j].invoked, true,
+					      "Callback %zu not invoked on mempool %s destruction",
+					      j, name);
+			RTE_TEST_ASSERT_EQUAL(data[j].event,
+					      RTE_MEMPOOL_EVENT_DESTROY,
+					      "Wrong callback invoked, expected DESTROY");
+			RTE_TEST_ASSERT_EQUAL(data[j].mp, freed,
+					      "Callback %zu invoked for a wrong mempool instead of %s",
+					      j, name);
+		}
+		RTE_TEST_ASSERT_EQUAL(data[0].invoked, false,
+				      "Unregistered callback 0 invoked on %s mempool destruction",
+				      name);
+	}
+
+	for (j = 1; j < CB_NUM; j++) {
+		ret = rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+		RTE_TEST_ASSERT_EQUAL(ret, 0, "Failed to unregister the callback %zu: %s",
+				      j, rte_strerror(rte_errno));
+	}
+	return TEST_SUCCESS;
+
+fail:
+	for (j = 0; j < CB_NUM; j++)
+		rte_mempool_event_callback_unregister
+					(test_mempool_events_cb, &data[j]);
+	for (i = 0; i < MP_NUM; i++)
+		rte_mempool_free(mp[i]);
+	return TEST_FAILED;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
+struct test_mempool_events_safety_data {
+	bool invoked;
+	int (*api_func)(rte_mempool_event_callback *func, void *user_data);
+	rte_mempool_event_callback *cb_func;
+	void *cb_user_data;
+	int ret;
+};
+
+static void
+test_mempool_events_safety_cb(enum rte_mempool_event event,
+			      struct rte_mempool *mp, void *user_data)
+{
+	struct test_mempool_events_safety_data *data = user_data;
+
+	RTE_SET_USED(event);
+	RTE_SET_USED(mp);
+	data->invoked = true;
+	data->ret = data->api_func(data->cb_func, data->cb_user_data);
+}
+
+static int
+test_mempool_events_safety(void)
+{
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+	struct test_mempool_events_data data;
+	struct test_mempool_events_safety_data sdata[2];
+	struct rte_mempool *mp;
+	size_t i;
+	int ret;
+
+	/* removes itself */
+	sdata[0].api_func = rte_mempool_event_callback_unregister;
+	sdata[0].cb_func = test_mempool_events_safety_cb;
+	sdata[0].cb_user_data = &sdata[0];
+	sdata[0].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[0]);
+	/* inserts a callback after itself */
+	sdata[1].api_func = rte_mempool_event_callback_register;
+	sdata[1].cb_func = test_mempool_events_cb;
+	sdata[1].cb_user_data = &data;
+	sdata[1].ret = -1;
+	rte_mempool_event_callback_register(test_mempool_events_safety_cb,
+					    &sdata[1]);
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	memset(&data, 0, sizeof(data));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(-ret));
+
+	RTE_TEST_ASSERT_EQUAL(sdata[0].ret, 0, "Callback failed to unregister itself: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, 0, "Failed to insert a new callback: %s",
+			      rte_strerror(rte_errno));
+	RTE_TEST_ASSERT_EQUAL(data.invoked, false,
+			      "Inserted callback is invoked on mempool population");
+
+	memset(&data, 0, sizeof(data));
+	sdata[0].invoked = false;
+	rte_mempool_free(mp);
+	mp = NULL;
+	RTE_TEST_ASSERT_EQUAL(sdata[0].invoked, false,
+			      "Callback that unregistered itself was called");
+	RTE_TEST_ASSERT_EQUAL(sdata[1].ret, -EEXIST,
+			      "New callback inserted twice");
+	RTE_TEST_ASSERT_EQUAL(data.invoked, true,
+			      "Inserted callback is not invoked on mempool destruction");
+
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	ret = TEST_SUCCESS;
+
+exit:
+	/* cleanup, don't care which callbacks are already removed */
+	rte_mempool_event_callback_unregister(test_mempool_events_cb, &data);
+	for (i = 0; i < RTE_DIM(sdata); i++)
+		rte_mempool_event_callback_unregister
+				(test_mempool_events_safety_cb, &sdata[i]);
+	/* in case of failure before the planned destruction */
+	rte_mempool_free(mp);
+	return ret;
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+}
+
 static int
 test_mempool(void)
 {
@@ -666,6 +906,14 @@ test_mempool(void)
 	if (test_mempool_basic(default_pool, 1) < 0)
 		GOTO_ERR(ret, err);
 
+	/* test mempool event callbacks */
+	if (test_mempool_events(rte_mempool_populate_default) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events(rte_mempool_populate_anon) < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_events_safety() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 607419ccaf..8810d08ab5 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -42,6 +42,18 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)
 
+TAILQ_HEAD(mempool_callback_list, rte_tailq_entry);
+
+static struct rte_tailq_elem callback_tailq = {
+	.name = "RTE_MEMPOOL_CALLBACK",
+};
+EAL_REGISTER_TAILQ(callback_tailq)
+
+/* Invoke all registered mempool event callbacks. */
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp);
+
 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
@@ -360,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* Report the mempool as ready only when fully populated. */
+	if (mp->populated_size >= mp->size)
+		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
+
 	rte_mempool_trace_populate_iova(mp, vaddr, iova, len, free_cb, opaque);
 	return i;
 
@@ -722,6 +738,7 @@ rte_mempool_free(struct rte_mempool *mp)
 	}
 	rte_mcfg_tailq_write_unlock();
 
+	mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_DESTROY, mp);
 	rte_mempool_trace_free(mp);
 	rte_mempool_free_memchunks(mp);
 	rte_mempool_ops_free(mp);
@@ -1356,3 +1373,110 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *),
 
 	rte_mcfg_mempool_read_unlock();
 }
+
+struct mempool_callback_data {
+	rte_mempool_event_callback *func;
+	void *user_data;
+};
+
+static void
+mempool_event_callback_invoke(enum rte_mempool_event event,
+			      struct rte_mempool *mp)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te;
+	void *tmp_te;
+
+	rte_mcfg_tailq_read_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		struct mempool_callback_data *cb = te->data;
+		rte_mcfg_tailq_read_unlock();
+		cb->func(event, mp, cb->user_data);
+		rte_mcfg_tailq_read_lock();
+	}
+	rte_mcfg_tailq_read_unlock();
+}
+
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	void *tmp_te;
+	int ret;
+
+	if (func == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	RTE_TAILQ_FOREACH_SAFE(te, list, next, tmp_te) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			ret = -EEXIST;
+			goto exit;
+		}
+	}
+
+	te = rte_zmalloc("mempool_cb_tail_entry", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback tailq entry!\n");
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb = rte_malloc("mempool_cb_data", sizeof(*cb), 0);
+	if (cb == NULL) {
+		RTE_LOG(ERR, MEMPOOL,
+			"Cannot allocate event callback!\n");
+		rte_free(te);
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	cb->func = func;
+	cb->user_data = user_data;
+	te->data = cb;
+	TAILQ_INSERT_TAIL(list, te, next);
+	ret = 0;
+
+exit:
+	rte_mcfg_tailq_write_unlock();
+	rte_errno = -ret;
+	return ret;
+}
+
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data)
+{
+	struct mempool_callback_list *list;
+	struct rte_tailq_entry *te = NULL;
+	struct mempool_callback_data *cb;
+	int ret = -ENOENT;
+
+	rte_mcfg_tailq_write_lock();
+	list = RTE_TAILQ_CAST(callback_tailq.head, mempool_callback_list);
+	TAILQ_FOREACH(te, list, next) {
+		cb = te->data;
+		if (cb->func == func && cb->user_data == user_data) {
+			TAILQ_REMOVE(list, te, next);
+			ret = 0;
+			break;
+		}
+	}
+	rte_mcfg_tailq_write_unlock();
+
+	if (ret == 0) {
+		rte_free(te);
+		rte_free(cb);
+	}
+	rte_errno = -ret;
+	return ret;
+}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 88bcbc51ef..5799d4a705 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1769,6 +1769,68 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, void *arg),
 int
 rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz);
 
+/**
+ * Mempool event type.
+ * @internal
+ */
+enum rte_mempool_event {
+	/** Occurs after a mempool is fully populated. */
+	RTE_MEMPOOL_EVENT_READY = 0,
+	/** Occurs before the destruction of a mempool begins. */
+	RTE_MEMPOOL_EVENT_DESTROY = 1,
+};
+
+/**
+ * @internal
+ * Mempool event callback.
+ *
+ * rte_mempool_event_callback_register() may be called from within the callback,
+ * but the callbacks registered this way will not be invoked for the same event.
+ * rte_mempool_event_callback_unregister() may only be safely called
+ * to remove the running callback.
+ */
+typedef void (rte_mempool_event_callback)(
+		enum rte_mempool_event event,
+		struct rte_mempool *mp,
+		void *user_data);
+
+/**
+ * @internal
+ * Register a callback function invoked on mempool life cycle event.
+ * The function will be invoked in the process
+ * that performs an action which triggers the callback.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_register(rte_mempool_event_callback *func,
+				    void *user_data);
+
+/**
+ * @internal
+ * Unregister a callback added with rte_mempool_event_callback_register().
+ * @p func and @p user_data must exactly match registration parameters.
+ *
+ * @param func
+ *   Callback function.
+ * @param user_data
+ *   User data.
+ *
+ * @return
+ *   0 on success, negative on failure and rte_errno is set.
+ */
+__rte_internal
+int
+rte_mempool_event_callback_unregister(rte_mempool_event_callback *func,
+				      void *user_data);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index 9f77da6fff..1b7d7c5456 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -64,3 +64,11 @@ EXPERIMENTAL {
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
 };
+
+INTERNAL {
+	global:
+
+	# added in 21.11
+	rte_mempool_event_callback_register;
+	rte_mempool_event_callback_unregister;
+};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v9 2/4] mempool: add non-IO flag
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-18 22:43                 ` Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 22:43 UTC (permalink / raw)
  To: dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Maryam Tahhan,
	Reshma Pattan, Olivier Matz

Mempool is a generic allocator that is not necessarily used
for device IO operations and its memory for DMA.
Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
a) if their objects are not contiguous;
b) if IOVA is not available for any object.
Other components can inspect this flag
in order to optimize their memory management.

Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/proc-info/main.c                   |   6 +-
 app/test/test_mempool.c                | 115 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_21_11.rst |   3 +
 lib/mempool/rte_mempool.c              |  10 +++
 lib/mempool/rte_mempool.h              |   2 +
 5 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/app/proc-info/main.c b/app/proc-info/main.c
index a8e928fa9f..8ec9cadd79 100644
--- a/app/proc-info/main.c
+++ b/app/proc-info/main.c
@@ -1295,7 +1295,8 @@ show_mempool(char *name)
 				"\t  -- No cache align (%c)\n"
 				"\t  -- SP put (%c), SC get (%c)\n"
 				"\t  -- Pool created (%c)\n"
-				"\t  -- No IOVA config (%c)\n",
+				"\t  -- No IOVA config (%c)\n"
+				"\t  -- Not used for IO (%c)\n",
 				ptr->name,
 				ptr->socket_id,
 				(flags & MEMPOOL_F_NO_SPREAD) ? 'y' : 'n',
@@ -1303,7 +1304,8 @@ show_mempool(char *name)
 				(flags & MEMPOOL_F_SP_PUT) ? 'y' : 'n',
 				(flags & MEMPOOL_F_SC_GET) ? 'y' : 'n',
 				(flags & MEMPOOL_F_POOL_CREATED) ? 'y' : 'n',
-				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n');
+				(flags & MEMPOOL_F_NO_IOVA_CONTIG) ? 'y' : 'n',
+				(flags & MEMPOOL_F_NON_IO) ? 'y' : 'n');
 			printf("  - Size %u Cache %u element %u\n"
 				"  - header %u trailer %u\n"
 				"  - private data size %u\n",
diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 5339a4cbd8..f4947680bc 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -12,6 +12,7 @@
 #include <sys/queue.h>
 
 #include <rte_common.h>
+#include <rte_eal_paging.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_errno.h>
@@ -729,6 +730,112 @@ test_mempool_events_safety(void)
 #pragma pop_macro("RTE_TEST_TRACE_FAILURE")
 }
 
+#pragma push_macro("RTE_TEST_TRACE_FAILURE")
+#undef RTE_TEST_TRACE_FAILURE
+#define RTE_TEST_TRACE_FAILURE(...) do { \
+		ret = TEST_FAILED; \
+		goto exit; \
+	} while (0)
+
+static int
+test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
+{
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, MEMPOOL_F_NO_IOVA_CONTIG);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(), NULL);
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(-ret));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when NO_IOVA_CONTIG is set");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_when_populated_with_valid_iova(void)
+{
+	void *virt = NULL;
+	rte_iova_t iova;
+	size_t total_size = MEMPOOL_ELT_SIZE * MEMPOOL_SIZE;
+	size_t block_size = total_size / 3;
+	struct rte_mempool *mp = NULL;
+	int ret;
+
+	/*
+	 * Since objects from the pool are never used in the test,
+	 * we don't care for contiguous IOVA, on the other hand,
+	 * reiuring it could cause spurious test failures.
+	 */
+	virt = rte_malloc("test_mempool", total_size, rte_mem_page_size());
+	RTE_TEST_ASSERT_NOT_NULL(virt, "Cannot allocate memory");
+	iova = rte_mem_virt2iova(virt);
+	RTE_TEST_ASSERT_NOT_EQUAL(iova,  RTE_BAD_IOVA, "Cannot get IOVA");
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 1 * block_size),
+					RTE_BAD_IOVA, block_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(-ret));
+	RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
+			"NON_IO flag is not set when mempool is populated with only RTE_BAD_IOVA");
+
+	ret = rte_mempool_populate_iova(mp, virt, iova, block_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(-ret));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is not unset when mempool is populated with valid IOVA");
+
+	ret = rte_mempool_populate_iova(mp, RTE_PTR_ADD(virt, 2 * block_size),
+					RTE_BAD_IOVA, block_size, NULL, NULL);
+	RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
+			rte_strerror(-ret));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set even when some objects have valid IOVA");
+	ret = TEST_SUCCESS;
+
+exit:
+	rte_mempool_free(mp);
+	rte_free(virt);
+	return ret;
+}
+
+static int
+test_mempool_flag_non_io_unset_by_default(void)
+{
+	struct rte_mempool *mp;
+	int ret;
+
+	mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
+				      MEMPOOL_ELT_SIZE, 0, 0,
+				      SOCKET_ID_ANY, 0);
+	RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
+				 rte_strerror(rte_errno));
+	ret = rte_mempool_populate_default(mp);
+	RTE_TEST_ASSERT_EQUAL(ret, (int)mp->size, "Failed to populate mempool: %s",
+			      rte_strerror(-ret));
+	RTE_TEST_ASSERT(!(mp->flags & MEMPOOL_F_NON_IO),
+			"NON_IO flag is set by default");
+	ret = TEST_SUCCESS;
+exit:
+	rte_mempool_free(mp);
+	return ret;
+}
+
+#pragma pop_macro("RTE_TEST_TRACE_FAILURE")
+
 static int
 test_mempool(void)
 {
@@ -914,6 +1021,14 @@ test_mempool(void)
 	if (test_mempool_events_safety() < 0)
 		GOTO_ERR(ret, err);
 
+	/* test NON_IO flag inference */
+	if (test_mempool_flag_non_io_unset_by_default() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_set_when_no_iova_contig_set() < 0)
+		GOTO_ERR(ret, err);
+	if (test_mempool_flag_non_io_unset_when_populated_with_valid_iova() < 0)
+		GOTO_ERR(ret, err);
+
 	rte_mempool_list_dump(stdout);
 
 	ret = 0;
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d5435a64aa..f6bb5adeff 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -237,6 +237,9 @@ API Changes
   the crypto/security operation. This field will be used to communicate
   events such as soft expiry with IPsec in lookaside mode.
 
+* mempool: Added ``MEMPOOL_F_NON_IO`` flag to give a hint to DPDK components
+  that objects from this pool will not be used for device IO (e.g. DMA).
+
 
 ABI Changes
 -----------
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 8810d08ab5..7d7d97d85d 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -372,6 +372,10 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	STAILQ_INSERT_TAIL(&mp->mem_list, memhdr, next);
 	mp->nb_mem_chunks++;
 
+	/* At least some objects in the pool can now be used for IO. */
+	if (iova != RTE_BAD_IOVA)
+		mp->flags &= ~MEMPOOL_F_NON_IO;
+
 	/* Report the mempool as ready only when fully populated. */
 	if (mp->populated_size >= mp->size)
 		mempool_event_callback_invoke(RTE_MEMPOOL_EVENT_READY, mp);
@@ -851,6 +855,12 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 		return NULL;
 	}
 
+	/*
+	 * No objects in the pool can be used for IO until it's populated
+	 * with at least some objects with valid IOVA.
+	 */
+	flags |= MEMPOOL_F_NON_IO;
+
 	/* "no cache align" imply "no spread" */
 	if (flags & MEMPOOL_F_NO_CACHE_ALIGN)
 		flags |= MEMPOOL_F_NO_SPREAD;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 5799d4a705..b2e20c8855 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -257,6 +257,8 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_IOVA_CONTIG 0x0020 /**< Don't need IOVA contiguous objs. */
+/** Internal: no object from the pool can be used for device IO (DMA). */
+#define MEMPOOL_F_NON_IO         0x0040
 
 /**
  * @internal When debug is enabled, store some statistics.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v9 3/4] common/mlx5: add mempool registration facilities
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 1/4] mempool: add event callbacks Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-18 22:43                 ` Dmitry Kozlyuk
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
  2021-10-19 14:36                 ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Thomas Monjalon
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 22:43 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella, Anatoly Burakov

Add internal API to register mempools, that is, to create memory
regions (MR) for their memory and store them in a separate database.
Implementation deals with multi-process, so that class drivers don't
need to. Each protection domain has its own database. Memory regions
can be shared within a database if they represent a single hugepage
covering one or more mempools entirely.

Add internal API to lookup an MR key for an address that belongs
to a known mempool. It is a responsibility of a class driver
to extract the mempool from an mbuf.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_mp.c |  50 +++
 drivers/common/mlx5/mlx5_common_mp.h |  14 +
 drivers/common/mlx5/mlx5_common_mr.c | 580 +++++++++++++++++++++++++++
 drivers/common/mlx5/mlx5_common_mr.h |  17 +
 drivers/common/mlx5/version.map      |   5 +
 5 files changed, 666 insertions(+)

diff --git a/drivers/common/mlx5/mlx5_common_mp.c b/drivers/common/mlx5/mlx5_common_mp.c
index 673a7c31de..6dfc5535e0 100644
--- a/drivers/common/mlx5/mlx5_common_mp.c
+++ b/drivers/common/mlx5/mlx5_common_mp.c
@@ -54,6 +54,56 @@ mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr)
 	return ret;
 }
 
+/**
+ * @param mp_id
+ *   ID of the MP process.
+ * @param share_cache
+ *   Shared MR cache.
+ * @param pd
+ *   Protection domain.
+ * @param mempool
+ *   Mempool to register or unregister.
+ * @param reg
+ *   True to register the mempool, False to unregister.
+ */
+int
+mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg)
+{
+	struct rte_mp_msg mp_req;
+	struct rte_mp_msg *mp_res;
+	struct rte_mp_reply mp_rep;
+	struct mlx5_mp_param *req = (struct mlx5_mp_param *)mp_req.param;
+	struct mlx5_mp_arg_mempool_reg *arg = &req->args.mempool_reg;
+	struct mlx5_mp_param *res;
+	struct timespec ts = {.tv_sec = MLX5_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
+	enum mlx5_mp_req_type type;
+	int ret;
+
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_SECONDARY);
+	type = reg ? MLX5_MP_REQ_MEMPOOL_REGISTER :
+		     MLX5_MP_REQ_MEMPOOL_UNREGISTER;
+	mp_init_msg(mp_id, &mp_req, type);
+	arg->share_cache = share_cache;
+	arg->pd = pd;
+	arg->mempool = mempool;
+	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
+	if (ret) {
+		DRV_LOG(ERR, "port %u request to primary process failed",
+			mp_id->port_id);
+		return -rte_errno;
+	}
+	MLX5_ASSERT(mp_rep.nb_received == 1);
+	mp_res = &mp_rep.msgs[0];
+	res = (struct mlx5_mp_param *)mp_res->param;
+	ret = res->result;
+	if (ret)
+		rte_errno = -ret;
+	mlx5_free(mp_rep.msgs);
+	return ret;
+}
+
 /**
  * Request Verbs queue state modification to the primary process.
  *
diff --git a/drivers/common/mlx5/mlx5_common_mp.h b/drivers/common/mlx5/mlx5_common_mp.h
index 6829141fc7..527bf3cad8 100644
--- a/drivers/common/mlx5/mlx5_common_mp.h
+++ b/drivers/common/mlx5/mlx5_common_mp.h
@@ -14,6 +14,8 @@
 enum mlx5_mp_req_type {
 	MLX5_MP_REQ_VERBS_CMD_FD = 1,
 	MLX5_MP_REQ_CREATE_MR,
+	MLX5_MP_REQ_MEMPOOL_REGISTER,
+	MLX5_MP_REQ_MEMPOOL_UNREGISTER,
 	MLX5_MP_REQ_START_RXTX,
 	MLX5_MP_REQ_STOP_RXTX,
 	MLX5_MP_REQ_QUEUE_STATE_MODIFY,
@@ -33,6 +35,12 @@ struct mlx5_mp_arg_queue_id {
 	uint16_t queue_id; /* DPDK queue ID. */
 };
 
+struct mlx5_mp_arg_mempool_reg {
+	struct mlx5_mr_share_cache *share_cache;
+	void *pd; /* NULL for MLX5_MP_REQ_MEMPOOL_UNREGISTER */
+	struct rte_mempool *mempool;
+};
+
 /* Pameters for IPC. */
 struct mlx5_mp_param {
 	enum mlx5_mp_req_type type;
@@ -41,6 +49,8 @@ struct mlx5_mp_param {
 	RTE_STD_C11
 	union {
 		uintptr_t addr; /* MLX5_MP_REQ_CREATE_MR */
+		struct mlx5_mp_arg_mempool_reg mempool_reg;
+		/* MLX5_MP_REQ_MEMPOOL_(UN)REGISTER */
 		struct mlx5_mp_arg_queue_state_modify state_modify;
 		/* MLX5_MP_REQ_QUEUE_STATE_MODIFY */
 		struct mlx5_mp_arg_queue_id queue_id;
@@ -91,6 +101,10 @@ void mlx5_mp_uninit_secondary(const char *name);
 __rte_internal
 int mlx5_mp_req_mr_create(struct mlx5_mp_id *mp_id, uintptr_t addr);
 __rte_internal
+int mlx5_mp_req_mempool_reg(struct mlx5_mp_id *mp_id,
+			struct mlx5_mr_share_cache *share_cache, void *pd,
+			struct rte_mempool *mempool, bool reg);
+__rte_internal
 int mlx5_mp_req_queue_state_modify(struct mlx5_mp_id *mp_id,
 				   struct mlx5_mp_arg_queue_state_modify *sm);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
index 98fe8698e2..2e039a4e70 100644
--- a/drivers/common/mlx5/mlx5_common_mr.c
+++ b/drivers/common/mlx5/mlx5_common_mr.c
@@ -2,7 +2,10 @@
  * Copyright 2016 6WIND S.A.
  * Copyright 2020 Mellanox Technologies, Ltd
  */
+#include <stddef.h>
+
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_mempool.h>
 #include <rte_malloc.h>
@@ -21,6 +24,29 @@ struct mr_find_contig_memsegs_data {
 	const struct rte_memseg_list *msl;
 };
 
+/* Virtual memory range. */
+struct mlx5_range {
+	uintptr_t start;
+	uintptr_t end;
+};
+
+/** Memory region for a mempool. */
+struct mlx5_mempool_mr {
+	struct mlx5_pmd_mr pmd_mr;
+	uint32_t refcnt; /**< Number of mempools sharing this MR. */
+};
+
+/* Mempool registration. */
+struct mlx5_mempool_reg {
+	LIST_ENTRY(mlx5_mempool_reg) next;
+	/** Registered mempool, used to designate registrations. */
+	struct rte_mempool *mp;
+	/** Memory regions for the address ranges of the mempool. */
+	struct mlx5_mempool_mr *mrs;
+	/** Number of memory regions. */
+	unsigned int mrs_n;
+};
+
 /**
  * Expand B-tree table to a given size. Can't be called with holding
  * memory_hotplug_lock or share_cache.rwlock due to rte_realloc().
@@ -1191,3 +1217,557 @@ mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused)
 	rte_rwlock_read_unlock(&share_cache->rwlock);
 #endif
 }
+
+static int
+mlx5_range_compare_start(const void *lhs, const void *rhs)
+{
+	const struct mlx5_range *r1 = lhs, *r2 = rhs;
+
+	if (r1->start > r2->start)
+		return 1;
+	else if (r1->start < r2->start)
+		return -1;
+	return 0;
+}
+
+static void
+mlx5_range_from_mempool_chunk(struct rte_mempool *mp, void *opaque,
+			      struct rte_mempool_memhdr *memhdr,
+			      unsigned int idx)
+{
+	struct mlx5_range *ranges = opaque, *range = &ranges[idx];
+	uint64_t page_size = rte_mem_page_size();
+
+	RTE_SET_USED(mp);
+	range->start = RTE_ALIGN_FLOOR((uintptr_t)memhdr->addr, page_size);
+	range->end = RTE_ALIGN_CEIL(range->start + memhdr->len, page_size);
+}
+
+/**
+ * Get VA-contiguous ranges of the mempool memory.
+ * Each range start and end is aligned to the system page size.
+ *
+ * @param[in] mp
+ *   Analyzed mempool.
+ * @param[out] out
+ *   Receives the ranges, caller must release it with free().
+ * @param[out] ount_n
+ *   Receives the number of @p out elements.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_get_mempool_ranges(struct rte_mempool *mp, struct mlx5_range **out,
+			unsigned int *out_n)
+{
+	struct mlx5_range *chunks;
+	unsigned int chunks_n = mp->nb_mem_chunks, contig_n, i;
+
+	/* Collect page-aligned memory ranges of the mempool. */
+	chunks = calloc(sizeof(chunks[0]), chunks_n);
+	if (chunks == NULL)
+		return -1;
+	rte_mempool_mem_iter(mp, mlx5_range_from_mempool_chunk, chunks);
+	/* Merge adjacent chunks and place them at the beginning. */
+	qsort(chunks, chunks_n, sizeof(chunks[0]), mlx5_range_compare_start);
+	contig_n = 1;
+	for (i = 1; i < chunks_n; i++)
+		if (chunks[i - 1].end != chunks[i].start) {
+			chunks[contig_n - 1].end = chunks[i - 1].end;
+			chunks[contig_n] = chunks[i];
+			contig_n++;
+		}
+	/* Extend the last contiguous chunk to the end of the mempool. */
+	chunks[contig_n - 1].end = chunks[i - 1].end;
+	*out = chunks;
+	*out_n = contig_n;
+	return 0;
+}
+
+/**
+ * Analyze mempool memory to select memory ranges to register.
+ *
+ * @param[in] mp
+ *   Mempool to analyze.
+ * @param[out] out
+ *   Receives memory ranges to register, aligned to the system page size.
+ *   The caller must release them with free().
+ * @param[out] out_n
+ *   Receives the number of @p out items.
+ * @param[out] share_hugepage
+ *   Receives True if the entire pool resides within a single hugepage.
+ *
+ * @return
+ *   0 on success, (-1) on failure.
+ */
+static int
+mlx5_mempool_reg_analyze(struct rte_mempool *mp, struct mlx5_range **out,
+			 unsigned int *out_n, bool *share_hugepage)
+{
+	struct mlx5_range *ranges = NULL;
+	unsigned int i, ranges_n = 0;
+	struct rte_memseg_list *msl;
+
+	if (mlx5_get_mempool_ranges(mp, &ranges, &ranges_n) < 0) {
+		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
+			mp->name);
+		return -1;
+	}
+	/* Check if the hugepage of the pool can be shared. */
+	*share_hugepage = false;
+	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
+	if (msl != NULL) {
+		uint64_t hugepage_sz = 0;
+
+		/* Check that all ranges are on pages of the same size. */
+		for (i = 0; i < ranges_n; i++) {
+			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
+				break;
+			hugepage_sz = msl->page_sz;
+		}
+		if (i == ranges_n) {
+			/*
+			 * If the entire pool is within one hugepage,
+			 * combine all ranges into one of the hugepage size.
+			 */
+			uintptr_t reg_start = ranges[0].start;
+			uintptr_t reg_end = ranges[ranges_n - 1].end;
+			uintptr_t hugepage_start =
+				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
+			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
+			if (reg_end < hugepage_end) {
+				ranges[0].start = hugepage_start;
+				ranges[0].end = hugepage_end;
+				ranges_n = 1;
+				*share_hugepage = true;
+			}
+		}
+	}
+	*out = ranges;
+	*out_n = ranges_n;
+	return 0;
+}
+
+/** Create a registration object for the mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_create(struct rte_mempool *mp, unsigned int mrs_n)
+{
+	struct mlx5_mempool_reg *mpr = NULL;
+
+	mpr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+			  sizeof(*mpr) + mrs_n * sizeof(mpr->mrs[0]),
+			  RTE_CACHE_LINE_SIZE, SOCKET_ID_ANY);
+	if (mpr == NULL) {
+		DRV_LOG(ERR, "Cannot allocate mempool %s registration object",
+			mp->name);
+		return NULL;
+	}
+	mpr->mp = mp;
+	mpr->mrs = (struct mlx5_mempool_mr *)(mpr + 1);
+	mpr->mrs_n = mrs_n;
+	return mpr;
+}
+
+/**
+ * Destroy a mempool registration object.
+ *
+ * @param standalone
+ *   Whether @p mpr owns its MRs excludively, i.e. they are not shared.
+ */
+static void
+mlx5_mempool_reg_destroy(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mempool_reg *mpr, bool standalone)
+{
+	if (standalone) {
+		unsigned int i;
+
+		for (i = 0; i < mpr->mrs_n; i++)
+			share_cache->dereg_mr_cb(&mpr->mrs[i].pmd_mr);
+	}
+	mlx5_free(mpr);
+}
+
+/** Find registration object of a mempool. */
+static struct mlx5_mempool_reg *
+mlx5_mempool_reg_lookup(struct mlx5_mr_share_cache *share_cache,
+			struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp)
+			break;
+	return mpr;
+}
+
+/** Increment reference counters of MRs used in the registration. */
+static void
+mlx5_mempool_reg_attach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		__atomic_add_fetch(&mpr->mrs[i].refcnt, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * Decrement reference counters of MRs used in the registration.
+ *
+ * @return True if no more references to @p mpr MRs exist, False otherwise.
+ */
+static bool
+mlx5_mempool_reg_detach(struct mlx5_mempool_reg *mpr)
+{
+	unsigned int i;
+	bool ret = false;
+
+	for (i = 0; i < mpr->mrs_n; i++)
+		ret |= __atomic_sub_fetch(&mpr->mrs[i].refcnt, 1,
+					  __ATOMIC_RELAXED) == 0;
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache,
+				 void *pd, struct rte_mempool *mp)
+{
+	struct mlx5_range *ranges = NULL;
+	struct mlx5_mempool_reg *mpr, *new_mpr;
+	unsigned int i, ranges_n;
+	bool share_hugepage;
+	int ret = -1;
+
+	/* Early check to avoid unnecessary creation of MRs. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+	if (mlx5_mempool_reg_analyze(mp, &ranges, &ranges_n,
+				     &share_hugepage) < 0) {
+		DRV_LOG(ERR, "Cannot get mempool %s memory ranges", mp->name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	new_mpr = mlx5_mempool_reg_create(mp, ranges_n);
+	if (new_mpr == NULL) {
+		DRV_LOG(ERR,
+			"Cannot create a registration object for mempool %s in PD %p",
+			mp->name, pd);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+	/*
+	 * If the entire mempool fits in a single hugepage, the MR for this
+	 * hugepage can be shared across mempools that also fit in it.
+	 */
+	if (share_hugepage) {
+		rte_rwlock_write_lock(&share_cache->rwlock);
+		LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next) {
+			if (mpr->mrs[0].pmd_mr.addr == (void *)ranges[0].start)
+				break;
+		}
+		if (mpr != NULL) {
+			new_mpr->mrs = mpr->mrs;
+			mlx5_mempool_reg_attach(new_mpr);
+			LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+					 new_mpr, next);
+		}
+		rte_rwlock_write_unlock(&share_cache->rwlock);
+		if (mpr != NULL) {
+			DRV_LOG(DEBUG, "Shared MR %#x in PD %p for mempool %s with mempool %s",
+				mpr->mrs[0].pmd_mr.lkey, pd, mp->name,
+				mpr->mp->name);
+			ret = 0;
+			goto exit;
+		}
+	}
+	for (i = 0; i < ranges_n; i++) {
+		struct mlx5_mempool_mr *mr = &new_mpr->mrs[i];
+		const struct mlx5_range *range = &ranges[i];
+		size_t len = range->end - range->start;
+
+		if (share_cache->reg_mr_cb(pd, (void *)range->start, len,
+		    &mr->pmd_mr) < 0) {
+			DRV_LOG(ERR,
+				"Failed to create an MR in PD %p for address range "
+				"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+				pd, range->start, range->end, len, mp->name);
+			break;
+		}
+		DRV_LOG(DEBUG,
+			"Created a new MR %#x in PD %p for address range "
+			"[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s",
+			mr->pmd_mr.lkey, pd, range->start, range->end, len,
+			mp->name);
+	}
+	if (i != ranges_n) {
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EINVAL;
+		goto exit;
+	}
+	/* Concurrent registration is not supposed to happen. */
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr == NULL) {
+		mlx5_mempool_reg_attach(new_mpr);
+		LIST_INSERT_HEAD(&share_cache->mempool_reg_list,
+				 new_mpr, next);
+		ret = 0;
+	}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr != NULL) {
+		DRV_LOG(DEBUG, "Mempool %s is already registered for PD %p",
+			mp->name, pd);
+		mlx5_mempool_reg_destroy(share_cache, new_mpr, true);
+		rte_errno = EEXIST;
+		goto exit;
+	}
+exit:
+	free(ranges);
+	return ret;
+}
+
+static int
+mlx5_mr_mempool_register_secondary(struct mlx5_mr_share_cache *share_cache,
+				   void *pd, struct rte_mempool *mp,
+				   struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, pd, mp, true);
+}
+
+/**
+ * Register the memory of a mempool in the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param pd
+ *   Protection domain object.
+ * @param mp
+ *   Mempool to register.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_register_primary(share_cache, pd, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_register_secondary(share_cache, pd, mp,
+							  mp_id);
+	default:
+		return -1;
+	}
+}
+
+static int
+mlx5_mr_mempool_unregister_primary(struct mlx5_mr_share_cache *share_cache,
+				   struct rte_mempool *mp)
+{
+	struct mlx5_mempool_reg *mpr;
+	bool standalone = false;
+
+	rte_rwlock_write_lock(&share_cache->rwlock);
+	LIST_FOREACH(mpr, &share_cache->mempool_reg_list, next)
+		if (mpr->mp == mp) {
+			LIST_REMOVE(mpr, next);
+			standalone = mlx5_mempool_reg_detach(mpr);
+			if (standalone)
+				/*
+				 * The unlock operation below provides a memory
+				 * barrier due to its store-release semantics.
+				 */
+				++share_cache->dev_gen;
+			break;
+		}
+	rte_rwlock_write_unlock(&share_cache->rwlock);
+	if (mpr == NULL) {
+		rte_errno = ENOENT;
+		return -1;
+	}
+	mlx5_mempool_reg_destroy(share_cache, mpr, standalone);
+	return 0;
+}
+
+static int
+mlx5_mr_mempool_unregister_secondary(struct mlx5_mr_share_cache *share_cache,
+				     struct rte_mempool *mp,
+				     struct mlx5_mp_id *mp_id)
+{
+	if (mp_id == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return mlx5_mp_req_mempool_reg(mp_id, share_cache, NULL, mp, false);
+}
+
+/**
+ * Unregister the memory of a mempool from the protection domain.
+ *
+ * @param share_cache
+ *   Shared MR cache of the protection domain.
+ * @param mp
+ *   Mempool to unregister.
+ * @param mp_id
+ *   Multi-process identifier, may be NULL for the primary process.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id)
+{
+	if (mp->flags & MEMPOOL_F_NON_IO)
+		return 0;
+	switch (rte_eal_process_type()) {
+	case RTE_PROC_PRIMARY:
+		return mlx5_mr_mempool_unregister_primary(share_cache, mp);
+	case RTE_PROC_SECONDARY:
+		return mlx5_mr_mempool_unregister_secondary(share_cache, mp,
+							    mp_id);
+	default:
+		return -1;
+	}
+}
+
+/**
+ * Lookup a MR key by and address in a registered mempool.
+ *
+ * @param mpr
+ *   Mempool registration object.
+ * @param addr
+ *   Address within the mempool.
+ * @param entry
+ *   Bottom-half cache entry to fill.
+ *
+ * @return
+ *   MR key or UINT32_MAX on failure, which can only happen
+ *   if the address is not from within the mempool.
+ */
+static uint32_t
+mlx5_mempool_reg_addr2mr(struct mlx5_mempool_reg *mpr, uintptr_t addr,
+			 struct mr_cache_entry *entry)
+{
+	uint32_t lkey = UINT32_MAX;
+	unsigned int i;
+
+	for (i = 0; i < mpr->mrs_n; i++) {
+		const struct mlx5_pmd_mr *mr = &mpr->mrs[i].pmd_mr;
+		uintptr_t mr_addr = (uintptr_t)mr->addr;
+
+		if (mr_addr <= addr) {
+			lkey = rte_cpu_to_be_32(mr->lkey);
+			entry->start = mr_addr;
+			entry->end = mr_addr + mr->len;
+			entry->lkey = lkey;
+			break;
+		}
+	}
+	return lkey;
+}
+
+/**
+ * Update bottom-half cache from the list of mempool registrations.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param entry
+ *   Pointer to an entry in the bottom-half cache to update
+ *   with the MR lkey looked up.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+static uint32_t
+mlx5_lookup_mempool_regs(struct mlx5_mr_share_cache *share_cache,
+			 struct mlx5_mr_ctrl *mr_ctrl,
+			 struct mr_cache_entry *entry,
+			 struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
+	struct mlx5_mempool_reg *mpr;
+	uint32_t lkey = UINT32_MAX;
+
+	/* If local cache table is full, try to double it. */
+	if (unlikely(bt->len == bt->size))
+		mr_btree_expand(bt, bt->size << 1);
+	/* Look up in mempool registrations. */
+	rte_rwlock_read_lock(&share_cache->rwlock);
+	mpr = mlx5_mempool_reg_lookup(share_cache, mp);
+	if (mpr != NULL)
+		lkey = mlx5_mempool_reg_addr2mr(mpr, addr, entry);
+	rte_rwlock_read_unlock(&share_cache->rwlock);
+	/*
+	 * Update local cache. Even if it fails, return the found entry
+	 * to update top-half cache. Next time, this entry will be found
+	 * in the global cache.
+	 */
+	if (lkey != UINT32_MAX)
+		mr_btree_insert(bt, entry);
+	return lkey;
+}
+
+/**
+ * Bottom-half lookup for the address from the mempool.
+ *
+ * @param share_cache
+ *   Pointer to a global shared MR cache.
+ * @param mr_ctrl
+ *   Per-queue MR control handle.
+ * @param mp
+ *   Mempool containing the address.
+ * @param addr
+ *   Address to lookup.
+ * @return
+ *   MR lkey on success, UINT32_MAX on failure.
+ */
+uint32_t
+mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+		      struct mlx5_mr_ctrl *mr_ctrl,
+		      struct rte_mempool *mp, uintptr_t addr)
+{
+	struct mr_cache_entry *repl = &mr_ctrl->cache[mr_ctrl->head];
+	uint32_t lkey;
+	uint16_t bh_idx = 0;
+
+	/* Binary-search MR translation table. */
+	lkey = mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
+	/* Update top-half cache. */
+	if (likely(lkey != UINT32_MAX)) {
+		*repl = (*mr_ctrl->cache_bh.table)[bh_idx];
+	} else {
+		lkey = mlx5_lookup_mempool_regs(share_cache, mr_ctrl, repl,
+						mp, addr);
+		/* Can only fail if the address is not from the mempool. */
+		if (unlikely(lkey == UINT32_MAX))
+			return UINT32_MAX;
+	}
+	/* Update the most recently used entry. */
+	mr_ctrl->mru = mr_ctrl->head;
+	/* Point to the next victim, the oldest. */
+	mr_ctrl->head = (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
+	return lkey;
+}
diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h
index 6e465a05e9..685ac98e08 100644
--- a/drivers/common/mlx5/mlx5_common_mr.h
+++ b/drivers/common/mlx5/mlx5_common_mr.h
@@ -13,6 +13,7 @@
 
 #include <rte_rwlock.h>
 #include <rte_bitmap.h>
+#include <rte_mbuf.h>
 #include <rte_memory.h>
 
 #include "mlx5_glue.h"
@@ -75,6 +76,7 @@ struct mlx5_mr_ctrl {
 } __rte_packed;
 
 LIST_HEAD(mlx5_mr_list, mlx5_mr);
+LIST_HEAD(mlx5_mempool_reg_list, mlx5_mempool_reg);
 
 /* Global per-device MR cache. */
 struct mlx5_mr_share_cache {
@@ -83,6 +85,7 @@ struct mlx5_mr_share_cache {
 	struct mlx5_mr_btree cache; /* Global MR cache table. */
 	struct mlx5_mr_list mr_list; /* Registered MR list. */
 	struct mlx5_mr_list mr_free_list; /* Freed MR list. */
+	struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */
 	mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */
 	mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */
 } __rte_packed;
@@ -136,6 +139,10 @@ uint32_t mlx5_mr_addr2mr_bh(void *pd, struct mlx5_mp_id *mp_id,
 			    struct mlx5_mr_ctrl *mr_ctrl,
 			    uintptr_t addr, unsigned int mr_ext_memseg_en);
 __rte_internal
+uint32_t mlx5_mr_mempool2mr_bh(struct mlx5_mr_share_cache *share_cache,
+			       struct mlx5_mr_ctrl *mr_ctrl,
+			       struct rte_mempool *mp, uintptr_t addr);
+__rte_internal
 void mlx5_mr_release_cache(struct mlx5_mr_share_cache *mr_cache);
 __rte_internal
 void mlx5_mr_dump_cache(struct mlx5_mr_share_cache *share_cache __rte_unused);
@@ -179,4 +186,14 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr);
 __rte_internal
 void
 mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb);
+
+__rte_internal
+int
+mlx5_mr_mempool_register(struct mlx5_mr_share_cache *share_cache, void *pd,
+			 struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+__rte_internal
+int
+mlx5_mr_mempool_unregister(struct mlx5_mr_share_cache *share_cache,
+			   struct rte_mempool *mp, struct mlx5_mp_id *mp_id);
+
 #endif /* RTE_PMD_MLX5_COMMON_MR_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index d3c5040aac..85100d5afb 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,4 +152,9 @@ INTERNAL {
 	mlx5_realloc;
 
 	mlx5_translate_port_name; # WINDOWS_NO_EXPORT
+
+	mlx5_mr_mempool_register;
+	mlx5_mr_mempool_unregister;
+	mlx5_mp_req_mempool_reg;
+	mlx5_mr_mempool2mr_bh;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dpdk-dev] [PATCH v9 4/4] net/mlx5: support mempool registration
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                                   ` (2 preceding siblings ...)
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-18 22:43                 ` Dmitry Kozlyuk
  2021-10-19 14:36                 ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Thomas Monjalon
  4 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-18 22:43 UTC (permalink / raw)
  To: dev; +Cc: Matan Azrad, Viacheslav Ovsiienko

When the first port in a given protection domain (PD) starts,
install a mempool event callback for this PD and register all existing
memory regions (MR) for it. When the last port in a PD closes,
remove the callback and unregister all mempools for this PD.
This behavior can be switched off with a new devarg: mr_mempool_reg_en.

On TX slow path, i.e. when an MR key for the address of the buffer
to send is not in the local cache, first try to retrieve it from
the database of registered mempools. Supported are direct and indirect
mbufs, as well as externally-attached ones from MLX5 MPRQ feature.
Lookup in the database of non-mempool memory is used as the last resort.

RX mempools are registered regardless of the devarg value.
On RX data path only the local cache and the mempool database is used.
If implicit mempool registration is disabled, these mempools
are unregistered at port stop, releasing the MRs.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  13 +++
 doc/guides/rel_notes/release_21_11.rst |   6 +
 drivers/net/mlx5/linux/mlx5_mp_os.c    |  44 +++++++
 drivers/net/mlx5/linux/mlx5_os.c       |   4 +-
 drivers/net/mlx5/mlx5.c                | 152 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h                |  10 ++
 drivers/net/mlx5/mlx5_mr.c             | 120 +++++--------------
 drivers/net/mlx5/mlx5_mr.h             |   2 -
 drivers/net/mlx5/mlx5_rx.h             |  21 ++--
 drivers/net/mlx5/mlx5_rxq.c            |  13 +++
 drivers/net/mlx5/mlx5_trigger.c        |  77 +++++++++++--
 drivers/net/mlx5/windows/mlx5_os.c     |   1 +
 12 files changed, 347 insertions(+), 116 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bae73f42d8..106e32e1c4 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1001,6 +1001,19 @@ Driver options
 
   Enabled by default.
 
+- ``mr_mempool_reg_en`` parameter [int]
+
+  A nonzero value enables implicit registration of DMA memory of all mempools
+  except those having ``MEMPOOL_F_NON_IO``. This flag is set automatically
+  for mempools populated with non-contiguous objects or those without IOVA.
+  The effect is that when a packet from a mempool is transmitted,
+  its memory is already registered for DMA in the PMD and no registration
+  will happen on the data path. The tradeoff is extra work on the creation
+  of each mempool and increased HW resource use if some mempools
+  are not used with MLX5 devices.
+
+  Enabled by default.
+
 - ``representor`` parameter [list]
 
   This parameter can be used to instantiate DPDK Ethernet devices from
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index f6bb5adeff..4d3374f7e7 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -167,6 +167,12 @@ New Features
   * Added tests to verify tunnel header verification in IPsec inbound.
   * Added tests to verify inner checksum.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added implicit mempool registration to avoid data path hiccups (opt-out).
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mlx5/linux/mlx5_mp_os.c b/drivers/net/mlx5/linux/mlx5_mp_os.c
index 3a4aa766f8..d2ac375a47 100644
--- a/drivers/net/mlx5/linux/mlx5_mp_os.c
+++ b/drivers/net/mlx5/linux/mlx5_mp_os.c
@@ -20,6 +20,45 @@
 #include "mlx5_tx.h"
 #include "mlx5_utils.h"
 
+/**
+ * Handle a port-agnostic message.
+ *
+ * @return
+ *   0 on success, 1 when message is not port-agnostic, (-1) on error.
+ */
+static int
+mlx5_mp_os_handle_port_agnostic(const struct rte_mp_msg *mp_msg,
+				const void *peer)
+{
+	struct rte_mp_msg mp_res;
+	struct mlx5_mp_param *res = (struct mlx5_mp_param *)mp_res.param;
+	const struct mlx5_mp_param *param =
+		(const struct mlx5_mp_param *)mp_msg->param;
+	const struct mlx5_mp_arg_mempool_reg *mpr;
+	struct mlx5_mp_id mp_id;
+
+	switch (param->type) {
+	case MLX5_MP_REQ_MEMPOOL_REGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_register(mpr->share_cache,
+						       mpr->pd, mpr->mempool,
+						       NULL);
+		return rte_mp_reply(&mp_res, peer);
+	case MLX5_MP_REQ_MEMPOOL_UNREGISTER:
+		mlx5_mp_id_init(&mp_id, param->port_id);
+		mp_init_msg(&mp_id, &mp_res, param->type);
+		mpr = &param->args.mempool_reg;
+		res->result = mlx5_mr_mempool_unregister(mpr->share_cache,
+							 mpr->mempool, NULL);
+		return rte_mp_reply(&mp_res, peer);
+	default:
+		return 1;
+	}
+	return -1;
+}
+
 int
 mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -34,6 +73,11 @@ mlx5_mp_os_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	int ret;
 
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
+	/* Port-agnostic messages. */
+	ret = mlx5_mp_os_handle_port_agnostic(mp_msg, peer);
+	if (ret <= 0)
+		return ret;
+	/* Port-specific messages. */
 	if (!rte_eth_dev_is_valid_port(param->port_id)) {
 		rte_errno = ENODEV;
 		DRV_LOG(ERR, "port %u invalid port ID", param->port_id);
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 3746057673..e036ed1435 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1034,8 +1034,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		err = mlx5_proc_priv_init(eth_dev);
 		if (err)
 			return NULL;
-		mp_id.port_id = eth_dev->data->port_id;
-		strlcpy(mp_id.name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+		mlx5_mp_id_init(&mp_id, eth_dev->data->port_id);
 		/* Receive command fd from primary process */
 		err = mlx5_mp_req_verbs_cmd_fd(&mp_id);
 		if (err < 0)
@@ -2133,6 +2132,7 @@ mlx5_os_config_default(struct mlx5_dev_config *config)
 	config->txqs_inline = MLX5_ARG_UNSET;
 	config->vf_nl_en = 1;
 	config->mr_ext_memseg_en = 1;
+	config->mr_mempool_reg_en = 1;
 	config->mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	config->mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	config->dv_esw_en = 1;
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 45ccfe2784..1e1b8b736b 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -181,6 +181,9 @@
 /* Device parameter to configure allow or prevent duplicate rules pattern. */
 #define MLX5_ALLOW_DUPLICATE_PATTERN "allow_duplicate_pattern"
 
+/* Device parameter to configure implicit registration of mempool memory. */
+#define MLX5_MR_MEMPOOL_REG_EN "mr_mempool_reg_en"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1088,6 +1091,141 @@ mlx5_alloc_rxtx_uars(struct mlx5_dev_ctx_shared *sh,
 	return err;
 }
 
+/**
+ * Unregister the mempool from the protection domain.
+ *
+ * @param sh
+ *   Pointer to the device shared context.
+ * @param mp
+ *   Mempool being unregistered.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister(struct mlx5_dev_ctx_shared *sh,
+				       struct rte_mempool *mp)
+{
+	struct mlx5_mp_id mp_id;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	if (mlx5_mr_mempool_unregister(&sh->share_cache, mp, &mp_id) < 0)
+		DRV_LOG(WARNING, "Failed to unregister mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to register mempools
+ * for the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_register_cb(struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+	int ret;
+
+	mlx5_mp_id_init(&mp_id, 0);
+	ret = mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp, &mp_id);
+	if (ret < 0 && rte_errno != EEXIST)
+		DRV_LOG(ERR, "Failed to register existing mempool %s for PD %p: %s",
+			mp->name, sh->pd, rte_strerror(rte_errno));
+}
+
+/**
+ * rte_mempool_walk() callback to unregister mempools
+ * from the protection domain.
+ *
+ * @param mp
+ *   The mempool being walked.
+ * @param arg
+ *   Pointer to the device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_unregister_cb(struct rte_mempool *mp, void *arg)
+{
+	mlx5_dev_ctx_shared_mempool_unregister
+				((struct mlx5_dev_ctx_shared *)arg, mp);
+}
+
+/**
+ * Mempool life cycle callback for Ethernet devices.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   Associated mempool.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_mempool_event_cb(enum rte_mempool_event event,
+				     struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_mp_id mp_id;
+
+	switch (event) {
+	case RTE_MEMPOOL_EVENT_READY:
+		mlx5_mp_id_init(&mp_id, 0);
+		if (mlx5_mr_mempool_register(&sh->share_cache, sh->pd, mp,
+					     &mp_id) < 0)
+			DRV_LOG(ERR, "Failed to register new mempool %s for PD %p: %s",
+				mp->name, sh->pd, rte_strerror(rte_errno));
+		break;
+	case RTE_MEMPOOL_EVENT_DESTROY:
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+		break;
+	}
+}
+
+/**
+ * Callback used when implicit mempool registration is disabled
+ * in order to track Rx mempool destruction.
+ *
+ * @param event
+ *   Mempool life cycle event.
+ * @param mp
+ *   An Rx mempool registered explicitly when the port is started.
+ * @param arg
+ *   Pointer to a device shared context.
+ */
+static void
+mlx5_dev_ctx_shared_rx_mempool_event_cb(enum rte_mempool_event event,
+					struct rte_mempool *mp, void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+
+	if (event == RTE_MEMPOOL_EVENT_DESTROY)
+		mlx5_dev_ctx_shared_mempool_unregister(sh, mp);
+}
+
+int
+mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
+	int ret;
+
+	/* Check if we only need to track Rx mempool destruction. */
+	if (!priv->config.mr_mempool_reg_en) {
+		ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+		return ret == 0 || rte_errno == EEXIST ? 0 : ret;
+	}
+	/* Callback for this shared context may be already registered. */
+	ret = rte_mempool_event_callback_register
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret != 0 && rte_errno != EEXIST)
+		return ret;
+	/* Register mempools only once for this shared context. */
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_register_cb, sh);
+	return 0;
+}
+
 /**
  * Allocate shared device context. If there is multiport device the
  * master and representors will share this context, if there is single
@@ -1287,6 +1425,8 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 void
 mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 {
+	int ret;
+
 	pthread_mutex_lock(&mlx5_dev_ctx_list_mutex);
 #ifdef RTE_LIBRTE_MLX5_DEBUG
 	/* Check the object presence in the list. */
@@ -1307,6 +1447,15 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	if (--sh->refcnt)
 		goto exit;
+	/* Stop watching for mempool events and unregister all mempools. */
+	ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_mempool_event_cb, sh);
+	if (ret < 0 && rte_errno == ENOENT)
+		ret = rte_mempool_event_callback_unregister
+				(mlx5_dev_ctx_shared_rx_mempool_event_cb, sh);
+	if (ret == 0)
+		rte_mempool_walk(mlx5_dev_ctx_shared_mempool_unregister_cb,
+				 sh);
 	/* Remove from memory callback device list. */
 	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
 	LIST_REMOVE(sh, mem_event_cb);
@@ -1997,6 +2146,8 @@ mlx5_args_check(const char *key, const char *val, void *opaque)
 		config->decap_en = !!tmp;
 	} else if (strcmp(MLX5_ALLOW_DUPLICATE_PATTERN, key) == 0) {
 		config->allow_duplicate_pattern = !!tmp;
+	} else if (strcmp(MLX5_MR_MEMPOOL_REG_EN, key) == 0) {
+		config->mr_mempool_reg_en = !!tmp;
 	} else {
 		DRV_LOG(WARNING, "%s: unknown parameter", key);
 		rte_errno = EINVAL;
@@ -2058,6 +2209,7 @@ mlx5_args(struct mlx5_dev_config *config, struct rte_devargs *devargs)
 		MLX5_SYS_MEM_EN,
 		MLX5_DECAP_EN,
 		MLX5_ALLOW_DUPLICATE_PATTERN,
+		MLX5_MR_MEMPOOL_REG_EN,
 		NULL,
 	};
 	struct rte_kvargs *kvlist;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3581414b78..fe533fcc81 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -155,6 +155,13 @@ struct mlx5_flow_dump_ack {
 /** Key string for IPC. */
 #define MLX5_MP_NAME "net_mlx5_mp"
 
+/** Initialize a multi-process ID. */
+static inline void
+mlx5_mp_id_init(struct mlx5_mp_id *mp_id, uint16_t port_id)
+{
+	mp_id->port_id = port_id;
+	strlcpy(mp_id->name, MLX5_MP_NAME, RTE_MP_MAX_NAME_LEN);
+}
 
 LIST_HEAD(mlx5_dev_list, mlx5_dev_ctx_shared);
 
@@ -270,6 +277,8 @@ struct mlx5_dev_config {
 	unsigned int dv_miss_info:1; /* restore packet after partial hw miss */
 	unsigned int allow_duplicate_pattern:1;
 	/* Allow/Prevent the duplicate rules pattern. */
+	unsigned int mr_mempool_reg_en:1;
+	/* Allow/prevent implicit mempool memory registration. */
 	struct {
 		unsigned int enabled:1; /* Whether MPRQ is enabled. */
 		unsigned int stride_num_n; /* Number of strides. */
@@ -1498,6 +1507,7 @@ struct mlx5_dev_ctx_shared *
 mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 			   const struct mlx5_dev_config *config);
 void mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh);
+int mlx5_dev_ctx_shared_mempool_subscribe(struct rte_eth_dev *dev);
 void mlx5_free_table_hash_list(struct mlx5_priv *priv);
 int mlx5_alloc_table_hash_list(struct mlx5_priv *priv);
 void mlx5_set_min_inline(struct mlx5_dev_spawn_data *spawn,
diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 44afda731f..55d27b50b9 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -65,30 +65,6 @@ mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 	}
 }
 
-/**
- * Bottom-half of LKey search on Rx.
- *
- * @param rxq
- *   Pointer to Rx queue structure.
- * @param addr
- *   Search key.
- *
- * @return
- *   Searched LKey on success, UINT32_MAX on no match.
- */
-uint32_t
-mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
-{
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
-	struct mlx5_priv *priv = rxq_ctrl->priv;
-
-	return mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, mr_ctrl, addr,
-				  priv->config.mr_ext_memseg_en);
-}
-
 /**
  * Bottom-half of LKey search on Tx.
  *
@@ -128,9 +104,36 @@ mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
 uint32_t
 mlx5_tx_mb2mr_bh(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
 {
+	struct mlx5_txq_ctrl *txq_ctrl =
+		container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct mlx5_mr_ctrl *mr_ctrl = &txq->mr_ctrl;
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	uintptr_t addr = (uintptr_t)mb->buf_addr;
 	uint32_t lkey;
 
+	if (priv->config.mr_mempool_reg_en) {
+		struct rte_mempool *mp = NULL;
+		struct mlx5_mprq_buf *buf;
+
+		if (!RTE_MBUF_HAS_EXTBUF(mb)) {
+			mp = mlx5_mb2mp(mb);
+		} else if (mb->shinfo->free_cb == mlx5_mprq_buf_free_cb) {
+			/* Recover MPRQ mempool. */
+			buf = mb->shinfo->fcb_opaque;
+			mp = buf->mp;
+		}
+		if (mp != NULL) {
+			lkey = mlx5_mr_mempool2mr_bh(&priv->sh->share_cache,
+						     mr_ctrl, mp, addr);
+			/*
+			 * Lookup can only fail on invalid input, e.g. "addr"
+			 * is not from "mp" or "mp" has MEMPOOL_F_NON_IO set.
+			 */
+			if (lkey != UINT32_MAX)
+				return lkey;
+		}
+		/* Fallback for generic mechanism in corner cases. */
+	}
 	lkey = mlx5_tx_addr2mr_bh(txq, addr);
 	if (lkey == UINT32_MAX && rte_errno == ENXIO) {
 		/* Mempool may have externally allocated memory. */
@@ -392,72 +395,3 @@ mlx5_tx_update_ext_mp(struct mlx5_txq_data *txq, uintptr_t addr,
 	mlx5_mr_update_ext_mp(ETH_DEV(priv), mr_ctrl, mp);
 	return mlx5_tx_addr2mr_bh(txq, addr);
 }
-
-/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
-static void
-mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void *opaque,
-		     struct rte_mempool_memhdr *memhdr,
-		     unsigned mem_idx __rte_unused)
-{
-	struct mr_update_mp_data *data = opaque;
-	struct rte_eth_dev *dev = data->dev;
-	struct mlx5_priv *priv = dev->data->dev_private;
-
-	uint32_t lkey;
-
-	/* Stop iteration if failed in the previous walk. */
-	if (data->ret < 0)
-		return;
-	/* Register address of the chunk and update local caches. */
-	lkey = mlx5_mr_addr2mr_bh(priv->sh->pd, &priv->mp_id,
-				  &priv->sh->share_cache, data->mr_ctrl,
-				  (uintptr_t)memhdr->addr,
-				  priv->config.mr_ext_memseg_en);
-	if (lkey == UINT32_MAX)
-		data->ret = -1;
-}
-
-/**
- * Register entire memory chunks in a Mempool.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @param mr_ctrl
- *   Pointer to per-queue MR control structure.
- * @param mp
- *   Pointer to registering Mempool.
- *
- * @return
- *   0 on success, -1 on failure.
- */
-int
-mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		  struct rte_mempool *mp)
-{
-	struct mr_update_mp_data data = {
-		.dev = dev,
-		.mr_ctrl = mr_ctrl,
-		.ret = 0,
-	};
-	uint32_t flags = rte_pktmbuf_priv_flags(mp);
-
-	if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
-		/*
-		 * The pinned external buffer should be registered for DMA
-		 * operations by application. The mem_list of the pool contains
-		 * the list of chunks with mbuf structures w/o built-in data
-		 * buffers and DMA actually does not happen there, no need
-		 * to create MR for these chunks.
-		 */
-		return 0;
-	}
-	DRV_LOG(DEBUG, "Port %u Rx queue registering mp %s "
-		       "having %u chunks.", dev->data->port_id,
-		       mp->name, mp->nb_mem_chunks);
-	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
-	if (data.ret < 0 && rte_errno == ENXIO) {
-		/* Mempool may have externally allocated memory. */
-		return mlx5_mr_update_ext_mp(dev, mr_ctrl, mp);
-	}
-	return data.ret;
-}
diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
index 4a7fab6df2..c984e777b5 100644
--- a/drivers/net/mlx5/mlx5_mr.h
+++ b/drivers/net/mlx5/mlx5_mr.h
@@ -22,7 +22,5 @@
 
 void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void *addr,
 			  size_t len, void *arg);
-int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl *mr_ctrl,
-		      struct rte_mempool *mp);
 
 #endif /* RTE_PMD_MLX5_MR_H_ */
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 2b7ad3e48b..1b00076fe7 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -275,13 +275,11 @@ uint16_t mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 				uint16_t pkts_n);
 
-/* mlx5_mr.c */
-
-uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
+static int mlx5_rxq_mprq_enabled(struct mlx5_rxq_data *rxq);
 
 /**
- * Query LKey from a packet buffer for Rx. No need to flush local caches for Rx
- * as mempool is pre-configured and static.
+ * Query LKey from a packet buffer for Rx. No need to flush local caches
+ * as the Rx mempool database entries are valid for the lifetime of the queue.
  *
  * @param rxq
  *   Pointer to Rx queue structure.
@@ -290,11 +288,14 @@ uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
  *
  * @return
  *   Searched LKey on success, UINT32_MAX on no match.
+ *   This function always succeeds on valid input.
  */
 static __rte_always_inline uint32_t
 mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 {
 	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct rte_mempool *mp;
 	uint32_t lkey;
 
 	/* Linear search on MR cache array. */
@@ -302,8 +303,14 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 				   MLX5_MR_CACHE_N, addr);
 	if (likely(lkey != UINT32_MAX))
 		return lkey;
-	/* Take slower bottom-half (Binary Search) on miss. */
-	return mlx5_rx_addr2mr_bh(rxq, addr);
+	/*
+	 * Slower search in the mempool database on miss.
+	 * During queue creation rxq->sh is not yet set, so we use rxq_ctrl.
+	 */
+	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->share_cache,
+				     mr_ctrl, mp, addr);
 }
 
 #define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)->buf_addr))
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b68443bed5..247f36e5d7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1162,6 +1162,7 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 	unsigned int strd_sz_n = 0;
 	unsigned int i;
 	unsigned int n_ibv = 0;
+	int ret;
 
 	if (!mlx5_mprq_enabled(dev))
 		return 0;
@@ -1241,6 +1242,16 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
+	ret = mlx5_mr_mempool_register(&priv->sh->share_cache, priv->sh->pd,
+				       mp, &priv->mp_id);
+	if (ret < 0 && rte_errno != EEXIST) {
+		ret = rte_errno;
+		DRV_LOG(ERR, "port %u failed to register a mempool for Multi-Packet RQ",
+			dev->data->port_id);
+		rte_mempool_free(mp);
+		rte_errno = ret;
+		return -rte_errno;
+	}
 	priv->mprq_mp = mp;
 exit:
 	/* Set mempool for each Rx queue. */
@@ -1443,6 +1454,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		/* rte_errno is already set. */
 		goto error;
 	}
+	/* Rx queues don't use this pointer, but we want a valid structure. */
+	tmpl->rxq.mr_ctrl.dev_gen_ptr = &priv->sh->share_cache.dev_gen;
 	tmpl->socket = socket;
 	if (dev->data->dev_conf.intr_conf.rxq)
 		tmpl->irq = 1;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 54173bfacb..3cbf5816a1 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -105,6 +105,59 @@ mlx5_txq_start(struct rte_eth_dev *dev)
 	return -rte_errno;
 }
 
+/**
+ * Translate the chunk address to MR key in order to put in into the cache.
+ */
+static void
+mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
+			     struct rte_mempool_memhdr *memhdr,
+			     unsigned int idx)
+{
+	struct mlx5_rxq_data *rxq = opaque;
+
+	RTE_SET_USED(mp);
+	RTE_SET_USED(idx);
+	mlx5_rx_addr2mr(rxq, (uintptr_t)memhdr->addr);
+}
+
+/**
+ * Register Rx queue mempools and fill the Rx queue cache.
+ * This function tolerates repeated mempool registration.
+ *
+ * @param[in] rxq_ctrl
+ *   Rx queue control data.
+ *
+ * @return
+ *   0 on success, (-1) on failure and rte_errno is set.
+ */
+static int
+mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+{
+	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct rte_mempool *mp;
+	uint32_t s;
+	int ret = 0;
+
+	mlx5_mr_flush_local_cache(&rxq_ctrl->rxq.mr_ctrl);
+	/* MPRQ mempool is registered on creation, just fill the cache. */
+	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
+		rte_mempool_mem_iter(rxq_ctrl->rxq.mprq_mp,
+				     mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+		return 0;
+	}
+	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
+		mp = rxq_ctrl->rxq.rxseg[s].mp;
+		ret = mlx5_mr_mempool_register(&priv->sh->share_cache,
+					       priv->sh->pd, mp, &priv->mp_id);
+		if (ret < 0 && rte_errno != EEXIST)
+			return ret;
+		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
+				     &rxq_ctrl->rxq);
+	}
+	return 0;
+}
+
 /**
  * Stop traffic on Rx queues.
  *
@@ -152,18 +205,13 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl)
 			continue;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/* Pre-register Rx mempools. */
-			if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq)) {
-				mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl,
-						  rxq_ctrl->rxq.mprq_mp);
-			} else {
-				uint32_t s;
-
-				for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++)
-					mlx5_mr_update_mp
-						(dev, &rxq_ctrl->rxq.mr_ctrl,
-						rxq_ctrl->rxq.rxseg[s].mp);
-			}
+			/*
+			 * Pre-register the mempools. Regardless of whether
+			 * the implicit registration is enabled or not,
+			 * Rx mempool destruction is tracked to free MRs.
+			 */
+			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
 				goto error;
@@ -1124,6 +1172,11 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 			dev->data->port_id, strerror(rte_errno));
 		goto error;
 	}
+	if (mlx5_dev_ctx_shared_mempool_subscribe(dev) != 0) {
+		DRV_LOG(ERR, "port %u failed to subscribe for mempool life cycle: %s",
+			dev->data->port_id, rte_strerror(rte_errno));
+		goto error;
+	}
 	rte_wmb();
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
diff --git a/drivers/net/mlx5/windows/mlx5_os.c b/drivers/net/mlx5/windows/mlx5_os.c
index 26fa927039..149253d174 100644
--- a/drivers/net/mlx5/windows/mlx5_os.c
+++ b/drivers/net/mlx5/windows/mlx5_os.c
@@ -1116,6 +1116,7 @@ mlx5_os_net_probe(struct rte_device *dev)
 	dev_config.txqs_inline = MLX5_ARG_UNSET;
 	dev_config.vf_nl_en = 0;
 	dev_config.mr_ext_memseg_en = 1;
+	dev_config.mr_mempool_reg_en = 1;
 	dev_config.mprq.max_memcpy_len = MLX5_MPRQ_MEMCPY_DEFAULT_LEN;
 	dev_config.mprq.min_rxqs_num = MLX5_MPRQ_MIN_RXQS;
 	dev_config.dv_esw_en = 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag
  2021-10-15 13:43             ` Olivier Matz
@ 2021-10-19 13:08               ` Dmitry Kozlyuk
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-19 13:08 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Andrew Rybchenko, Matan Azrad

> -----Original Message-----
> From: Olivier Matz <olivier.matz@6wind.com>
> Sent: 15 октября 2021 г. 16:43
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Cc: dev@dpdk.org; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Matan
> Azrad <matan@nvidia.com>
> Subject: Re: [PATCH v4 2/4] mempool: add non-IO flag
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Oct 15, 2021 at 01:27:59PM +0000, Dmitry Kozlyuk wrote:
> > > [...]
> > > > +static int
> > > > +test_mempool_flag_non_io_set_when_no_iova_contig_set(void)
> > > > +{
> > > > +     struct rte_mempool *mp;
> > > > +     int ret;
> > > > +
> > > > +     mp = rte_mempool_create_empty("empty", MEMPOOL_SIZE,
> > > > +                                   MEMPOOL_ELT_SIZE, 0, 0,
> > > > +                                   SOCKET_ID_ANY,
> > > MEMPOOL_F_NO_IOVA_CONTIG);
> > > > +     RTE_TEST_ASSERT_NOT_NULL(mp, "Cannot create mempool: %s",
> > > > +                              rte_strerror(rte_errno));
> > > > +     rte_mempool_set_ops_byname(mp, rte_mbuf_best_mempool_ops(),
> NULL);
> > > > +     ret = rte_mempool_populate_default(mp);
> > > > +     RTE_TEST_ASSERT(ret > 0, "Failed to populate mempool: %s",
> > > > +                     rte_strerror(rte_errno));
> > > > +     RTE_TEST_ASSERT(mp->flags & MEMPOOL_F_NON_IO,
> > > > +                     "NON_IO flag is not set when NO_IOVA_CONTIG is
> > > set");
> > > > +     rte_mempool_free(mp);
> > > > +     return 0;
> > > > +}
> > >
> > > One comment that also applies to the previous patch. Using
> > > RTE_TEST_ASSERT_*() is convenient to test a condition, display an
> error
> > > message and return on error in one operation. But here it can cause a
> > > leak on test failure.
> > >
> > > I don't know what is the best approach to solve the issue. Having
> > > equivalent test macros that do "goto fail" instead of "return -1"
> would
> > > help here. I mean something like:
> > >   RTE_TEST_ASSERT_GOTO_*(cond, label, fmt, ...)
> > >
> > > What do you think?
> >
> > This can work with existing macros:
> >
> >       #define TEST_TRACE_FAILURE(...) goto fail
> >
> > Because of "trace" in the name it looks a bit like a hijacking.
> > Probably the macro should be named TEST_HANDLE_FAILURE
> > to suggest broader usages than just tracing,
> > but for now it looks the most neat way.
> 
> That would work for me.

Did so in v9.

> What about introducing another macro for this usage, that would
> be "return -1" by default and that can be overridden?

I like this suggestion by itself.
While implementing the solution with RTE_TEST_TRACE_FAILURE()
I didn't like the hustle with #ifdef/#pragma push/pop_macro.
At least some of them could be hidden, need to play with macros
before suggesting something clean.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks
  2021-10-15  8:52         ` Andrew Rybchenko
  2021-10-15  9:13           ` Dmitry Kozlyuk
@ 2021-10-19 13:08           ` Dmitry Kozlyuk
  1 sibling, 0 replies; 82+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-19 13:08 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: Matan Azrad, Olivier Matz, Ray Kinsella, Anatoly Burakov

> > +/**
> > + * Mempool event type.
> > + * @internal
> 
> Shouldn't internal go first?
> 
> > + */
> > +enum rte_mempool_event {

It really should, but I had to keep it this way
because otherwise Doxygen fails on multiple systems:

[3110/3279] Generating doxygen with a custom command
FAILED: doc/api/html 
/root/dpdk/doc/api/generate_doxygen.sh doc/api/doxy-api.conf doc/api/html 	/root/dpdk/doc/api/doxy-html-custom.sh
/root/dpdk/lib/mempool/rte_mempool.h:1778: error: Member rte_mempool_event 	(enumeration) of file rte_mempool.h is not documented.
	(warning treated as error, aborting now)
/root/dpdk/doc/api/generate_doxygen.sh: line 12: 51733 Segmentation fault      
	(core dumped) doxygen "${DOXYCONF}" > $OUT_FILE

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit mempool registration
  2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
                                   ` (3 preceding siblings ...)
  2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
@ 2021-10-19 14:36                 ` Thomas Monjalon
  4 siblings, 0 replies; 82+ messages in thread
From: Thomas Monjalon @ 2021-10-19 14:36 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, olivier.matz, andrew.rybchenko

> Dmitry Kozlyuk (4):
>   mempool: add event callbacks
>   mempool: add non-IO flag
>   common/mlx5: add mempool registration facilities
>   net/mlx5: support mempool registration

Applied, thanks




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks Dmitry Kozlyuk
@ 2021-10-20  9:29           ` Kinsella, Ray
  0 siblings, 0 replies; 82+ messages in thread
From: Kinsella, Ray @ 2021-10-20  9:29 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Olivier Matz



On 15/10/2021 17:02, Dmitry Kozlyuk wrote:
> Data path performance can benefit if the PMD knows which memory it will
> need to handle in advance, before the first mbuf is sent to the PMD.
> It is impractical, however, to consider all allocated memory for this
> purpose. Most often mbuf memory comes from mempools that can come and
> go. PMD can enumerate existing mempools on device start, but it also
> needs to track creation and destruction of mempools after the forwarding
> starts but before an mbuf from the new mempool is sent to the device.
> 
> Add an API to register callback for mempool life cycle events:
> * rte_mempool_event_callback_register()
> * rte_mempool_event_callback_unregister()
> Currently tracked events are:
> * RTE_MEMPOOL_EVENT_READY (after populating a mempool)
> * RTE_MEMPOOL_EVENT_DESTROY (before freeing a mempool)
> Provide a unit test for the new API.
> The new API is internal, because it is primarily demanded by PMDs that
> may need to deal with any mempools and do not control their creation,
> while an application, on the other hand, knows which mempools it creates
> and doesn't care about internal mempools PMDs might create.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

Acked-by: Ray Kinsella <mdr@ashroe.eu>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities
  2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
@ 2021-10-20  9:30           ` Kinsella, Ray
  0 siblings, 0 replies; 82+ messages in thread
From: Kinsella, Ray @ 2021-10-20  9:30 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Matan Azrad, Viacheslav Ovsiienko, Anatoly Burakov



On 15/10/2021 17:02, Dmitry Kozlyuk wrote:
> Add internal API to register mempools, that is, to create memory
> regions (MR) for their memory and store them in a separate database.
> Implementation deals with multi-process, so that class drivers don't
> need to. Each protection domain has its own database. Memory regions
> can be shared within a database if they represent a single hugepage
> covering one or more mempools entirely.
> 
> Add internal API to lookup an MR key for an address that belongs
> to a known mempool. It is a responsibility of a class driver
> to extract the mempool from an mbuf.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> ---

Acked-by: Ray Kinsella <mdr@ashroe.eu>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag
  2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag Dmitry Kozlyuk
@ 2021-10-29  3:30                 ` Jiang, YuX
  0 siblings, 0 replies; 82+ messages in thread
From: Jiang, YuX @ 2021-10-29  3:30 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: David Marchand, Matan Azrad, Andrew Rybchenko, Tahhan, Maryam,
	Pattan, Reshma, Olivier Matz

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Dmitry Kozlyuk
> Sent: Monday, October 18, 2021 10:41 PM
> To: dev@dpdk.org
> Cc: David Marchand <david.marchand@redhat.com>; Matan Azrad
> <matan@oss.nvidia.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Tahhan, Maryam
> <maryam.tahhan@intel.com>; Pattan, Reshma <reshma.pattan@intel.com>;
> Olivier Matz <olivier.matz@6wind.com>
> Subject: [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag
> 
> Mempool is a generic allocator that is not necessarily used for device IO
> operations and its memory for DMA.
> Add MEMPOOL_F_NON_IO flag to mark such mempools automatically
> a) if their objects are not contiguous;
> b) if IOVA is not available for any object.
> Other components can inspect this flag
> in order to optimize their memory management.
> 
> Discussion: https://mails.dpdk.org/archives/dev/2021-August/216654.html
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Matan Azrad <matan@nvidia.com>
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---
>  app/proc-info/main.c                   |   6 +-
>  app/test/test_mempool.c                | 114 +++++++++++++++++++++++++
>  doc/guides/rel_notes/release_21_11.rst |   3 +
>  lib/mempool/rte_mempool.c              |  10 +++
>  lib/mempool/rte_mempool.h              |   2 +
>  5 files changed, 133 insertions(+), 2 deletions(-)
> 
Hi Dmitry,

We meet an issue based on this patch that mempool_autotest execute failed on FreeBSD13.0, bug id is https://bugs.dpdk.org/show_bug.cgi?id=863, could you pls have a look?
Reproduce steps:
2.lanch app
# ./x86_64-native-bsdapp-gcc/app/test/dpdk-test  -n 4 -c f
3. execute dpdk command 
# mempool_autotest
common_pool_count=1598
no statistics available
EAL: Test assert test_mempool_flag_non_io_unset_when_populated_with_valid_iova line 781 failed: Cannot get IOVA
test failed at test_mempool():1030
Test Failed

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2021-10-29  3:30 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  9:07 [dpdk-dev] [PATCH 0/4] net/mlx5: implicit mempool registration Dmitry Kozlyuk
2021-08-18  9:07 ` [dpdk-dev] [PATCH 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-12  3:12   ` Jerin Jacob
2021-08-18  9:07 ` [dpdk-dev] [PATCH 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-08-18  9:07 ` [dpdk-dev] [PATCH 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-08-18  9:07 ` [dpdk-dev] [PATCH 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-09-29 14:52 ` [dpdk-dev] [PATCH 0/4] net/mlx5: implicit " dkozlyuk
2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 1/4] mempool: add event callbacks dkozlyuk
2021-10-05 16:34     ` Thomas Monjalon
2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 2/4] mempool: add non-IO flag dkozlyuk
2021-10-05 16:39     ` Thomas Monjalon
2021-10-12  6:06       ` Andrew Rybchenko
2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 3/4] common/mlx5: add mempool registration facilities dkozlyuk
2021-09-29 14:52   ` [dpdk-dev] [PATCH v2 4/4] net/mlx5: support mempool registration dkozlyuk
2021-10-12  0:04   ` [dpdk-dev] [PATCH v3 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-12  6:33       ` Andrew Rybchenko
2021-10-12  9:37         ` Dmitry Kozlyuk
2021-10-12  9:46           ` Andrew Rybchenko
2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-12  3:37       ` Jerin Jacob
2021-10-12  6:42       ` Andrew Rybchenko
2021-10-12 12:40         ` Dmitry Kozlyuk
2021-10-12 12:53           ` Andrew Rybchenko
2021-10-12 13:11             ` Dmitry Kozlyuk
2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-12  0:04     ` [dpdk-dev] [PATCH v3 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-13 11:01     ` [dpdk-dev] [PATCH v4 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-15  8:52         ` Andrew Rybchenko
2021-10-15  9:13           ` Dmitry Kozlyuk
2021-10-19 13:08           ` Dmitry Kozlyuk
2021-10-15 12:12         ` Olivier Matz
2021-10-15 13:07           ` Dmitry Kozlyuk
2021-10-15 13:40             ` Olivier Matz
2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-15  9:01         ` Andrew Rybchenko
2021-10-15  9:18           ` Dmitry Kozlyuk
2021-10-15  9:33             ` Andrew Rybchenko
2021-10-15  9:38               ` Dmitry Kozlyuk
2021-10-15  9:43               ` Olivier Matz
2021-10-15  9:58                 ` Dmitry Kozlyuk
2021-10-15 12:11                   ` Olivier Matz
2021-10-15  9:25         ` David Marchand
2021-10-15 10:42           ` Dmitry Kozlyuk
2021-10-15 11:41             ` David Marchand
2021-10-15 12:13               ` Olivier Matz
2021-10-15 13:19         ` Olivier Matz
2021-10-15 13:27           ` Dmitry Kozlyuk
2021-10-15 13:43             ` Olivier Matz
2021-10-19 13:08               ` Dmitry Kozlyuk
2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-13 11:01       ` [dpdk-dev] [PATCH v4 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-15 16:02       ` [dpdk-dev] [PATCH v5 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-20  9:29           ` Kinsella, Ray
2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-20  9:30           ` Kinsella, Ray
2021-10-15 16:02         ` [dpdk-dev] [PATCH v5 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-16 20:00         ` [dpdk-dev] [PATCH v6 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-16 20:00           ` [dpdk-dev] [PATCH v6 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-18 10:01           ` [dpdk-dev] [PATCH v7 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-18 10:01             ` [dpdk-dev] [PATCH v7 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-18 14:40             ` [dpdk-dev] [PATCH v8 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-29  3:30                 ` Jiang, YuX
2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-18 14:40               ` [dpdk-dev] [PATCH v8 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-18 22:43               ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Dmitry Kozlyuk
2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 1/4] mempool: add event callbacks Dmitry Kozlyuk
2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 2/4] mempool: add non-IO flag Dmitry Kozlyuk
2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 3/4] common/mlx5: add mempool registration facilities Dmitry Kozlyuk
2021-10-18 22:43                 ` [dpdk-dev] [PATCH v9 4/4] net/mlx5: support mempool registration Dmitry Kozlyuk
2021-10-19 14:36                 ` [dpdk-dev] [PATCH v9 0/4] net/mlx5: implicit " Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).