From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <shahafs@mellanox.com>
Received: from EUR02-HE1-obe.outbound.protection.outlook.com
 (mail-eopbgr10056.outbound.protection.outlook.com [40.107.1.56])
 by dpdk.org (Postfix) with ESMTP id 84DB216E
 for <dev@dpdk.org>; Sun,  6 May 2018 14:53:21 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com;
 s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;
 bh=WbUHL5ROfQY5W5sV7CMuwfQJ7C8RAiCvqilrQ2RCp2U=;
 b=x8QdjeeX9buH8HOzHDwpEdq4fsfyDFWmBlmxRmwjggygOyICzNe5YdknhXd4hdiRrlzOG7Yp8OBBeJsN60VuiFfSgJSyRwOBddZBwp8ea0kqA5Onjh8vTwZZ2pItAyaF+/KZXnQb6L0R5UFCB1yO5VX354P8mkPKRXhRygM/QOc=
Received: from DB7PR05MB4426.eurprd05.prod.outlook.com (52.134.109.15) by
 DB7PR05MB4124.eurprd05.prod.outlook.com (52.134.107.141) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id
 15.20.735.18; Sun, 6 May 2018 12:53:19 +0000
Received: from DB7PR05MB4426.eurprd05.prod.outlook.com
 ([fe80::f116:5be4:ba29:fed8]) by DB7PR05MB4426.eurprd05.prod.outlook.com
 ([fe80::f116:5be4:ba29:fed8%13]) with mapi id 15.20.0735.018; Sun, 6 May 2018
 12:53:19 +0000
From: Shahaf Shuler <shahafs@mellanox.com>
To: Yongseok Koh <yskoh@mellanox.com>, Adrien Mazarguil
 <adrien.mazarguil@6wind.com>, =?iso-8859-1?Q?N=E9lio_Laranjeiro?=
 <nelio.laranjeiro@6wind.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, Yongseok Koh <yskoh@mellanox.com>
Thread-Topic: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support
Thread-Index: AQHT4mvRJkOlRSyxWEe66wF/gXpknaQiRbwA
Date: Sun, 6 May 2018 12:53:18 +0000
Message-ID: <DB7PR05MB44265B9995E1BE68EE73D9B5C3840@DB7PR05MB4426.eurprd05.prod.outlook.com>
References: <20180502231654.7596-1-yskoh@mellanox.com>
 <20180502231654.7596-4-yskoh@mellanox.com>
In-Reply-To: <20180502231654.7596-4-yskoh@mellanox.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=shahafs@mellanox.com; 
x-originating-ip: [31.154.10.107]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; DB7PR05MB4124;
 7:qm02UhVkMJ2IfTUR8I7DVtAkM/QluqqKdjDYN0EniIqCGawJwv2o0AfzO60Noy9OkzSWYZCgOuEVjJHZ1NEYe25Np+9G6NT4DuFYIRsaBJewQpmScAyTRmAgYS9MEG/5/a6BJ2vWw4hde745ZPRILt3FVYPRNy7R1c8bso1sTdZ/n1ka/rfr7guCl3zBCKIPyhZjvmZj5UhM5C+eppKyDb14e9x7SoSMU9+1D2ixdvT1CEMUCicovLZJT5PIp4ip
x-ms-exchange-antispam-srfa-diagnostics: SOS;
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
 RULEID:(7020095)(4652020)(5600026)(48565401081)(2017052603328)(7153060)(7193020);
 SRVR:DB7PR05MB4124; 
x-ms-traffictypediagnostic: DB7PR05MB4124:
x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr
x-microsoft-antispam-prvs: <DB7PR05MB4124619E3351D49978441938C3840@DB7PR05MB4124.eurprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(209352067349851)(788757137089)(17755550239193); 
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(93006095)(93001095)(3002001)(10201501046)(3231254)(944501410)(52105095)(6055026)(6041310)(20161123558120)(20161123560045)(20161123562045)(20161123564045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(6072148)(201708071742011);
 SRVR:DB7PR05MB4124; BCL:0; PCL:0; RULEID:; SRVR:DB7PR05MB4124; 
x-forefront-prvs: 06640999CA
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(396003)(366004)(39380400002)(346002)(376002)(39860400002)(189003)(199004)(2906002)(45080400002)(486006)(53946003)(107886003)(102836004)(478600001)(6436002)(59450400001)(2900100001)(6506007)(9686003)(3660700001)(16200700003)(3280700002)(6246003)(53936002)(4326008)(229853002)(76176011)(66066001)(26005)(3846002)(97736004)(86362001)(7696005)(55016002)(8936002)(305945005)(11346002)(7736002)(6116002)(54906003)(106356001)(81166006)(81156014)(99286004)(110136005)(14454004)(8676002)(575784001)(68736007)(33656002)(105586002)(5660300001)(25786009)(316002)(74316002)(446003)(476003)(5250100002)(569006);
 DIR:OUT; SFP:1101; SCL:1; SRVR:DB7PR05MB4124;
 H:DB7PR05MB4426.eurprd05.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; MX:1; A:1; 
received-spf: None (protection.outlook.com: mellanox.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: pvTpXx6VfNuN22Im5BCS5BBA9QgsFwzfEMGWSbZ/Ov2rkW5P/G8U8A2twkRuIr+Tak4CQtpMu/XmgoiTckFiEsfOEUUfa63mHzeKwhkfVv/LtQkc0MlaVdDP+EdsWW7rbbD92srJ1PZnllOCEkvVpWHSgSQmbeL5fMPSiARDez0QnxrtK9pP9r2h1ec1+Hme
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Office365-Filtering-Correlation-Id: 396cccfb-b81f-4093-0ea8-08d5b35057a8
X-OriginatorOrg: Mellanox.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 396cccfb-b81f-4093-0ea8-08d5b35057a8
X-MS-Exchange-CrossTenant-originalarrivaltime: 06 May 2018 12:53:19.0305 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR05MB4124
Subject: Re: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Sun, 06 May 2018 12:53:21 -0000

Hi Koh,

Huge work. It takes (and will take) me some time to process.=20
In the meanwhile find some small comments.=20

As this design heavily relies on synchronization (cache flush) between the =
control thread and the data path thread  along with possible deadlocks from=
 the memory hotplug events the documentation is critical.=20
Otherwise future work will introduce heavy bugs.=20


Thursday, May 3, 2018 2:17 AM, Yongseok Koh:
> Subject: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support
>=20
> This is the new design of Memory Region (MR) for mlx PMD, in order to:
> - Accommodate the new memory hotplug model.
> - Support non-contiguous Mempool.

This commit log is missing a lot of details about the design that you did. =
You must make it clear for every Mellanox PMD developer.=20

Just to make sure I understand all the details:
We have
1. Cache (L0) per rxq/txq in size of MLX5_MR_CACHE_N. searching in it start=
ing from mru and fallback to linear search
2. btree (L1) per rxq/txq in dynamic size. searching using binary search. T=
his is what you refer as the bottom half right?=20
3. global mr cache (L2) per device in dynamic size (?)=20
4. list of all MRs (L3) per device.=20

>=20
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5.c          |   45 ++
>  drivers/net/mlx5/mlx5.h          |   22 +
>  drivers/net/mlx5/mlx5_defs.h     |    6 +
>  drivers/net/mlx5/mlx5_ethdev.c   |   16 +
>  drivers/net/mlx5/mlx5_mr.c       | 1194
> ++++++++++++++++++++++++++++++++++++++
>  drivers/net/mlx5/mlx5_mr.h       |  121 ++++
>  drivers/net/mlx5/mlx5_rxq.c      |    8 +-
>  drivers/net/mlx5/mlx5_rxtx.c     |    3 +
>  drivers/net/mlx5/mlx5_rxtx.h     |   73 ++-
>  drivers/net/mlx5/mlx5_rxtx_vec.h |    6 +-
>  drivers/net/mlx5/mlx5_trigger.c  |   11 +
>  drivers/net/mlx5/mlx5_txq.c      |   11 +
>  12 files changed, 1508 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/net/mlx5/mlx5_mr.h
>=20
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> index 01d554758..2883f20af 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -41,6 +41,7 @@
>  #include "mlx5_autoconf.h"
>  #include "mlx5_defs.h"
>  #include "mlx5_glue.h"
> +#include "mlx5_mr.h"
>=20
>  /* Device parameter to enable RX completion queue compression. */
>  #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en"
> @@ -84,10 +85,49 @@
>  #define MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP (1 << 4)
>  #endif
>=20
> +static const char *MZ_MLX5_PMD_SHARED_DATA =3D
> "mlx5_pmd_shared_data";
> +
> +/* Shared memory between primary and secondary processes. */
> +struct mlx5_shared_data *mlx5_shared_data;
> +
> +/* Spinlock for mlx5_shared_data allocation. */
> +static rte_spinlock_t mlx5_shared_data_lock =3D RTE_SPINLOCK_INITIALIZER=
;
> +
>  /** Driver-specific log messages type. */
>  int mlx5_logtype;
>=20
>  /**
> + * Prepare shared data between primary and secondary process.
> + */
> +static void
> +mlx5_prepare_shared_data(void)
> +{
> +	const struct rte_memzone *mz;
> +
> +	rte_spinlock_lock(&mlx5_shared_data_lock);
> +	if (mlx5_shared_data =3D=3D NULL) {
> +		if (rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY) {
> +			/* Allocate shared memory. */
> +			mz =3D
> rte_memzone_reserve(MZ_MLX5_PMD_SHARED_DATA,
> +						 sizeof(*mlx5_shared_data),
> +						 SOCKET_ID_ANY, 0);
> +		} else {
> +			/* Lookup allocated shared memory. */
> +			mz =3D
> rte_memzone_lookup(MZ_MLX5_PMD_SHARED_DATA);
> +		}
> +		if (mz =3D=3D NULL)
> +			rte_panic("Cannot allocate mlx5 shared data\n");
> +		mlx5_shared_data =3D mz->addr;
> +		/* Initialize shared data. */
> +		if (rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY) {
> +			LIST_INIT(&mlx5_shared_data-
> >mem_event_cb_list);
> +			rte_rwlock_init(&mlx5_shared_data-
> >mem_event_rwlock);
> +		}
> +	}
> +	rte_spinlock_unlock(&mlx5_shared_data_lock);
> +}
> +

Can you elaborate why mlx5_shared_data can't be part of priv?=20
Priv is already allocated on the shared memory and rte_eth_dev layer enforc=
e the secondary process creation as part of the rte_eth_dev_data allocation=
.=20

> +/**
>   * Retrieve integer value from environment variable.
>   *
>   * @param[in] name
> @@ -201,6 +241,7 @@ mlx5_dev_close(struct rte_eth_dev *dev)
>  		priv->txqs =3D NULL;
>  	}
>  	mlx5_flow_delete_drop_queue(dev);
> +	mlx5_mr_release(dev);
>  	if (priv->pd !=3D NULL) {
>  		assert(priv->ctx !=3D NULL);
>  		claim_zero(mlx5_glue->dealloc_pd(priv->pd));
> @@ -633,6 +674,8 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv
> __rte_unused,
>  	struct ibv_counter_set_description cs_desc;
>  #endif
>=20
> +	/* Prepare shared data between primary and secondary process. */
> +	mlx5_prepare_shared_data();
>  	assert(pci_drv =3D=3D &mlx5_driver);
>  	/* Get mlx5_dev[] index. */
>  	idx =3D mlx5_dev_idx(&pci_dev->addr);
> @@ -1293,6 +1336,8 @@ rte_mlx5_pmd_init(void)
>  	}
>  	mlx5_glue->fork_init();
>  	rte_pci_register(&mlx5_driver);
> +	rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
> +					mlx5_mr_mem_event_cb);

mlx5_mr_mem_event_cb requires PMD private structure. Is registering for the=
 cb on the init makes sense? It looks like a better place is the PCI probe,=
 after the eth_dev allocation.=20

>  }
>=20
>  RTE_PMD_EXPORT_NAME(net_mlx5, __COUNTER__);
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
> index 47d266c90..d3fc74dc1 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -26,11 +26,13 @@
>  #include <rte_pci.h>
>  #include <rte_ether.h>
>  #include <rte_ethdev_driver.h>
> +#include <rte_rwlock.h>
>  #include <rte_interrupts.h>
>  #include <rte_errno.h>
>  #include <rte_flow.h>
>=20
>  #include "mlx5_utils.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_autoconf.h"
>  #include "mlx5_defs.h"
> @@ -50,6 +52,16 @@ enum {
>  	PCI_DEVICE_ID_MELLANOX_CONNECTX5EXVF =3D 0x101a,
>  };
>=20
> +LIST_HEAD(mlx5_dev_list, priv);
> +
> +/* Shared memory between primary and secondary processes. */
> +struct mlx5_shared_data {
> +	struct mlx5_dev_list mem_event_cb_list;
> +	rte_rwlock_t mem_event_rwlock;
> +};
> +
> +extern struct mlx5_shared_data *mlx5_shared_data;
> +
>  struct mlx5_xstats_ctrl {
>  	/* Number of device stats. */
>  	uint16_t stats_n;
> @@ -119,7 +131,10 @@ struct mlx5_verbs_alloc_ctx {
>  	const void *obj; /* Pointer to the DPDK object. */
>  };
>=20
> +LIST_HEAD(mlx5_mr_list, mlx5_mr);
> +
>  struct priv {
> +	LIST_ENTRY(priv) mem_event_cb; /* Called by memory event
> callback. */
>  	struct rte_eth_dev_data *dev_data;  /* Pointer to device data. */
>  	struct ibv_context *ctx; /* Verbs context. */
>  	struct ibv_device_attr_ex device_attr; /* Device properties. */
> @@ -146,6 +161,13 @@ struct priv {
>  	struct mlx5_hrxq_drop *flow_drop_queue; /* Flow drop queue. */
>  	struct mlx5_flows flows; /* RTE Flow rules. */
>  	struct mlx5_flows ctrl_flows; /* Control flow rules. */
> +	struct {
> +		uint32_t dev_gen; /* Generation number to flush local
> caches. */
> +		rte_rwlock_t rwlock; /* MR Lock. */
> +		struct mlx5_mr_btree cache; /* Global MR cache table. */
> +		struct mlx5_mr_list mr_list; /* Registered MR list. */
> +		struct mlx5_mr_list mr_free_list; /* Freed MR list. */
> +	} mr;
>  	LIST_HEAD(rxq, mlx5_rxq_ctrl) rxqsctrl; /* DPDK Rx queues. */
>  	LIST_HEAD(rxqibv, mlx5_rxq_ibv) rxqsibv; /* Verbs Rx queues. */
>  	LIST_HEAD(hrxq, mlx5_hrxq) hrxqs; /* Verbs Hash Rx queues. */
> diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h
> index f9093777d..72e80af26 100644
> --- a/drivers/net/mlx5/mlx5_defs.h
> +++ b/drivers/net/mlx5/mlx5_defs.h
> @@ -37,6 +37,12 @@
>   */
>  #define MLX5_TX_COMP_THRESH_INLINE_DIV (1 << 3)
>=20
> +/* Size of per-queue MR cache array for linear search. */
> +#define MLX5_MR_CACHE_N 8
> +
> +/* Size of MR cache table for binary search. */
> +#define MLX5_MR_BTREE_CACHE_N 256
> +
>  /*
>   * If defined, only use software counters. The PMD will never ask the
> hardware
>   * for these, and many of them won't be available.
> diff --git a/drivers/net/mlx5/mlx5_ethdev.c
> b/drivers/net/mlx5/mlx5_ethdev.c
> index 746b94f73..6bb43cf4e 100644
> --- a/drivers/net/mlx5/mlx5_ethdev.c
> +++ b/drivers/net/mlx5/mlx5_ethdev.c
> @@ -34,6 +34,7 @@
>  #include <rte_interrupts.h>
>  #include <rte_malloc.h>
>  #include <rte_string_fns.h>
> +#include <rte_rwlock.h>
>=20
>  #include "mlx5.h"
>  #include "mlx5_glue.h"
> @@ -413,6 +414,21 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
>  		if (++j =3D=3D rxqs_n)
>  			j =3D 0;
>  	}
> +	/*
> +	 * Once the device is added to the list of memory event callback, its
> +	 * global MR cache table cannot be expanded on the fly because of
> +	 * deadlock. If it overflows, lookup should be done by searching MR
> list
> +	 * linearly, which is slow.
> +	 */
> +	if (mlx5_mr_btree_init(&priv->mr.cache,
> MLX5_MR_BTREE_CACHE_N * 2,

Why multiple by 2? Because it holds all the rxq/txq mrs?=20

> +			       dev->device->numa_node)) {
> +		/* rte_errno is already set. */
> +		return -rte_errno;
> +	}
> +	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
> +	LIST_INSERT_HEAD(&mlx5_shared_data->mem_event_cb_list,
> +			 priv, mem_event_cb);
> +	rte_rwlock_write_unlock(&mlx5_shared_data-
> >mem_event_rwlock);
>  	return 0;
>  }

Why registration is done only on configure and not on probe after priv init=
ialization?=20

>=20
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> index 736c40ae4..e964912bb 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -13,8 +13,1202 @@
>=20
>  #include <rte_mempool.h>
>  #include <rte_malloc.h>
> +#include <rte_rwlock.h>
>=20
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_rxtx.h"
>  #include "mlx5_glue.h"
>=20
> +struct mr_find_contig_memsegs_data {
> +	uintptr_t addr;
> +	uintptr_t start;
> +	uintptr_t end;
> +	const struct rte_memseg_list *msl;
> +};
> +
> +struct mr_update_mp_data {
> +	struct rte_eth_dev *dev;
> +	struct mlx5_mr_ctrl *mr_ctrl;
> +	int ret;
> +};
> +
> +/**
> + * Expand B-tree table to a given size. Can't be called with holding
> + * memory_hotplug_lock or priv->mr.rwlock due to rte_realloc().
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries for expansion.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_expand(struct mlx5_mr_btree *bt, int n)
> +{
> +	void *mem;
> +	int ret =3D 0;
> +
> +	if (n <=3D bt->size)
> +		return ret;
> +	/*
> +	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> +	 * used inside if there's no room to expand. Because this is a quite
> +	 * rare case and a part of very slow path, it is very acceptable.
> +	 * Initially cache_bh[] will be given practically enough space and once
> +	 * it is expanded, expansion wouldn't be needed again ever.
> +	 */
> +	mem =3D rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
> +	if (mem =3D=3D NULL) {
> +		/* Not an error, B-tree search will be skipped. */
> +		DRV_LOG(WARNING, "failed to expand MR B-tree (%p)
> table",
> +			(void *)bt);


DRV_LOG should have the port id of the device. For all of the DRV_LOG insta=
nces in the patch.=20

Per my understating it falls back to the old bt in case the expansion faile=
d, right? Bt searches will still happen.=20

> +		ret =3D -1;
> +	} else {
> +		DRV_LOG(DEBUG, "expanded MR B-tree table (size=3D%u)",
> n);
> +		bt->table =3D mem;
> +		bt->size =3D n;
> +	}
> +	return ret;
> +}
> +
> +/**
> + * Look up LKey from given B-tree lookup table, store the last index and
> return
> + * searched LKey.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param[out] idx
> + *   Pointer to index. Even on searh failure, returns index where it sto=
ps

Searh->search=20

> + *   searching so that index can be used when inserting a new entry.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr)
> +{
> +	struct mlx5_mr_cache *lkp_tbl;
> +	uint16_t n;
> +	uint16_t base =3D 0;
> +
> +	assert(bt !=3D NULL);
> +	lkp_tbl =3D *bt->table;
> +	n =3D bt->len;
> +	/* First entry must be NULL for comparison. */
> +	assert(bt->len > 0 || (lkp_tbl[0].start =3D=3D 0 &&
> +			       lkp_tbl[0].lkey =3D=3D UINT32_MAX));
> +	/* Binary search. */
> +	do {
> +		register uint16_t delta =3D n >> 1;
> +
> +		if (addr < lkp_tbl[base + delta].start) {
> +			n =3D delta;
> +		} else {
> +			base +=3D delta;
> +			n -=3D delta;
> +		}
> +	} while (n > 1);
> +	assert(addr >=3D lkp_tbl[base].start);
> +	*idx =3D base;
> +	if (addr < lkp_tbl[base].end)
> +		return lkp_tbl[base].lkey;
> +	/* Not found. */
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Insert an entry to B-tree lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param entry
> + *   Pointer to new entry to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
> +{
> +	struct mlx5_mr_cache *lkp_tbl;
> +	uint16_t idx =3D 0;
> +	size_t shift;
> +
> +	assert(bt !=3D NULL);
> +	assert(bt->len <=3D bt->size);
> +	assert(bt->len > 0);
> +	lkp_tbl =3D *bt->table;
> +	/* Find out the slot for insertion. */
> +	if (mr_btree_lookup(bt, &idx, entry->start) !=3D UINT32_MAX) {
> +		DRV_LOG(DEBUG,
> +			"abort insertion to B-tree(%p):"
> +			" already exist at idx=3D%u [0x%lx, 0x%lx) lkey=3D0x%x",
> +			(void *)bt, idx, entry->start, entry->end, entry-
> >lkey);
> +		/* Already exist, return. */
> +		return 0;
> +	}
> +	/* If table is full, return error. */
> +	if (unlikely(bt->len =3D=3D bt->size)) {
> +		bt->overflow =3D 1;
> +		return -1;
> +	}
> +	/* Insert entry. */
> +	++idx;
> +	shift =3D (bt->len - idx) * sizeof(struct mlx5_mr_cache);
> +	if (shift)
> +		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> +	lkp_tbl[idx] =3D *entry;
> +	bt->len++;
> +	DRV_LOG(DEBUG,
> +		"inserted B-tree(%p)[%u], [0x%lx, 0x%lx) lkey=3D0x%x",
> +		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> +	return 0;
> +}

Can you elaborate on how you make sure the btree is always sorted based on =
the start addr?=20

> +
> +/**
> + * Initialize B-tree and allocate memory for lookup table.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + * @param n
> + *   Number of entries to allocate.
> + * @param socket
> + *   NUMA socket on which memory must be allocated.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set=
.
> + */
> +int
> +mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket)
> +{
> +	if (bt =3D=3D NULL) {
> +		rte_errno =3D EINVAL;
> +		return -rte_errno;
> +	}
> +	memset(bt, 0, sizeof(*bt));
> +	bt->table =3D rte_calloc_socket("B-tree table",
> +				      n, sizeof(struct mlx5_mr_cache),
> +				      0, socket);
> +	if (bt->table =3D=3D NULL) {
> +		rte_errno =3D ENOMEM;
> +		DRV_LOG(ERR,
> +			"failed to allocate memory for btree cache on socket
> %d",
> +			socket);
> +		return -rte_errno;
> +	}
> +	bt->size =3D n;
> +	/* First entry must be NULL for binary search. */
> +	(*bt->table)[bt->len++] =3D (struct mlx5_mr_cache) {
> +		.lkey =3D UINT32_MAX,
> +	};
> +	DRV_LOG(DEBUG, "initialized B-tree %p with table %p",
> +		(void *)bt, (void *)bt->table);
> +	return 0;
> +}
> +
> +/**
> + * Free B-tree resources.
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +void
> +mlx5_mr_btree_free(struct mlx5_mr_btree *bt)
> +{
> +	if (bt =3D=3D NULL)
> +		return;
> +	DRV_LOG(DEBUG, "freeing B-tree %p with table %p",
> +		(void *)bt, (void *)bt->table);
> +	rte_free(bt->table);
> +	memset(bt, 0, sizeof(*bt));
> +}
> +
> +/**
> + * Dump all the entries in a B-tree
> + *
> + * @param bt
> + *   Pointer to B-tree structure.
> + */
> +static void
> +mlx5_mr_btree_dump(struct mlx5_mr_btree *bt)
> +{
> +	int idx;
> +	struct mlx5_mr_cache *lkp_tbl;
> +
> +	if (bt =3D=3D NULL)
> +		return;
> +	lkp_tbl =3D *bt->table;
> +	for (idx =3D 0; idx < bt->len; ++idx) {
> +		struct mlx5_mr_cache *entry =3D &lkp_tbl[idx];
> +
> +		DRV_LOG(DEBUG,
> +			"B-tree(%p)[%u], [0x%lx, 0x%lx) lkey=3D0x%x",
> +			(void *)bt, idx, entry->start, entry->end, entry-
> >lkey);
> +	}
> +}
> +
> +/**
> + * Find virtually contiguous memory chunk in a given MR.
> + *
> + * @param dev
> + *   Pointer to MR structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If not found, this will not be
> + *   updated.
> + * @param start_idx
> + *   Start index of the memseg bitmap.
> + *
> + * @return
> + *   Next index to go on lookup.
> + */
> +static int
> +mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry,
> +		   int base_idx)
> +{
> +	uintptr_t start =3D 0;
> +	uintptr_t end =3D 0;
> +	uint32_t idx =3D 0;
> +
> +	for (idx =3D base_idx; idx < mr->ms_bmp_n; ++idx) {
> +		if (rte_bitmap_get(mr->ms_bmp, idx)) {
> +			const struct rte_memseg_list *msl;
> +			const struct rte_memseg *ms;
> +
> +			msl =3D mr->msl;
> +			ms =3D rte_fbarray_get(&msl->memseg_arr,
> +					     mr->ms_base_idx + idx);
> +			assert(msl->page_sz =3D=3D ms->hugepage_sz);
> +			if (!start)
> +				start =3D ms->addr_64;
> +			end =3D ms->addr_64 + ms->hugepage_sz;
> +		} else if (start) {
> +			/* Passed the end of a fragment. */
> +			break;
> +		}
> +	}
> +	if (start) {
> +		/* Found one chunk. */
> +		entry->start =3D start;
> +		entry->end =3D end;
> +		entry->lkey =3D rte_cpu_to_be_32(mr->ibv_mr->lkey);
> +	}
> +	return idx;
> +}
> +
> +/**
> + * Insert a MR to the global B-tree cache. It may fail due to low-on-mem=
ory.
> + * Then, this entry will have to be searched by mr_lookup_dev_list() in
> + * mlx5_mr_create() on miss.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param mr
> + *   Pointer to MR to insert.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +static int
> +mr_insert_dev_cache(struct rte_eth_dev *dev, struct mlx5_mr *mr)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	unsigned int n;
> +
> +	DRV_LOG(DEBUG, "port %u inserting MR(%p) to global cache",
> +		dev->data->port_id, (void *)mr);
> +	for (n =3D 0; n < mr->ms_bmp_n; ) {
> +		struct mlx5_mr_cache entry =3D { 0, };
> +
> +		/* Find a contiguous chunk and advance the index. */
> +		n =3D mr_find_next_chunk(mr, &entry, n);
> +		if (!entry.end)
> +			break;
> +		if (mr_btree_insert(&priv->mr.cache, &entry) < 0) {
> +			/*
> +			 * Overflowed, but the global table cannot be
> expanded
> +			 * because of deadlock.
> +			 */
> +			return -1;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Look up address in the original global MR list.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Found MR on match, NULL otherwise.
> + */
> +static struct mlx5_mr *
> +mr_lookup_dev_list(struct rte_eth_dev *dev, struct mlx5_mr_cache
> *entry,
> +		   uintptr_t addr)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr *mr;
> +
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &priv->mr.mr_list, mr) {
> +		unsigned int n;
> +
> +		if (mr->ms_n =3D=3D 0)
> +			continue;
> +		for (n =3D 0; n < mr->ms_bmp_n; ) {
> +			struct mlx5_mr_cache ret =3D { 0, };
> +
> +			n =3D mr_find_next_chunk(mr, &ret, n);
> +			if (addr >=3D ret.start && addr < ret.end) {
> +				/* Found. */
> +				*entry =3D ret;
> +				return mr;
> +			}
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * Look up address on device.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry. If no match, this will not be
> updated.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is se=
t.
> + */
> +static uint32_t
> +mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
> +	      uintptr_t addr)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	uint16_t idx;
> +	uint32_t lkey =3D UINT32_MAX;
> +	struct mlx5_mr *mr;
> +
> +	/*
> +	 * If the global cache has overflowed since it failed to expand the
> +	 * B-tree table, it can't have all the exisitng MRs. Then, the address
> +	 * has to be searched by traversing the original MR list instead, which
> +	 * is very slow path. Otherwise, the global cache is all inclusive.
> +	 */
> +	if (!unlikely(priv->mr.cache.overflow)) {
> +		lkey =3D mr_btree_lookup(&priv->mr.cache, &idx, addr);
> +		if (lkey !=3D UINT32_MAX)
> +			*entry =3D (*priv->mr.cache.table)[idx];
> +	} else {
> +		/* Falling back to the slowest path. */
> +		mr =3D mr_lookup_dev_list(dev, entry, addr);
> +		if (mr !=3D NULL)
> +			lkey =3D entry->lkey;
> +	}
> +	assert(lkey =3D=3D UINT32_MAX || (addr >=3D entry->start &&
> +				      addr < entry->end));
> +	return lkey;
> +}
> +
> +/**
> + * Free MR resources. MR lock must not be held to avoid a deadlock.
> rte_free()
> + * can raise memory free event and the callback function will spin on th=
e
> lock.
> + *
> + * @param mr
> + *   Pointer to MR to free.
> + */
> +static void
> +mr_free(struct mlx5_mr *mr)
> +{
> +	if (mr =3D=3D NULL)
> +		return;
> +	DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr);
> +	if (mr->ibv_mr !=3D NULL)
> +		claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr));
> +	if (mr->ms_bmp !=3D NULL)
> +		rte_bitmap_free(mr->ms_bmp);
> +	rte_free(mr);
> +}
> +
> +/**
> + * Free Memory Region (MR).
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param mr
> + *   Pointer to MR to free.
> + */
> +void
> +mlx5_mr_free(struct rte_eth_dev *dev, struct mlx5_mr *mr)
> +{

Who calls this function? I didn't saw any.=20

> +	struct priv *priv =3D dev->data->dev_private;
> +
> +	/* Detach from the list and free resources later. */
> +	rte_rwlock_write_lock(&priv->mr.rwlock);
> +	LIST_REMOVE(mr, mr);
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +	/*
> +	 * rte_free() inside can't be called with holding the lock. This could
> +	 * cause deadlock when calling free callback.
> +	 */
> +	mr_free(mr);
> +	DRV_LOG(DEBUG, "port %u MR(%p) freed", dev->data->port_id,
> (void *)mr);
> +}
> +
> +/**
> + * Releass resources of detached MR having no online entry.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + */
> +static void
> +mlx5_mr_garbage_collect(struct rte_eth_dev *dev)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr *mr_next;
> +	struct mlx5_mr_list free_list =3D LIST_HEAD_INITIALIZER(free_list);
> +
> +	/* Must be called from the primary process. */
> +	assert(rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY);

Perhaps it is better to have this check not under assert?

> +	/*
> +	 * MR can't be freed with holding the lock because rte_free() could
> call
> +	 * memory free callback function. This will be a deadlock situation.
> +	 */
> +	rte_rwlock_write_lock(&priv->mr.rwlock);
> +	/* Detach the whole free list and release it after unlocking. */
> +	free_list =3D priv->mr.mr_free_list;
> +	LIST_INIT(&priv->mr.mr_free_list);
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +	/* Release resources. */
> +	mr_next =3D LIST_FIRST(&free_list);
> +	while (mr_next !=3D NULL) {
> +		struct mlx5_mr *mr =3D mr_next;
> +
> +		mr_next =3D LIST_NEXT(mr, mr);
> +		mr_free(mr);
> +	}
> +}
> +
> +/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */
> +static int
> +mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl,
> +			  const struct rte_memseg *ms, size_t len, void *arg)
> +{
> +	struct mr_find_contig_memsegs_data *data =3D arg;
> +
> +	if (data->addr < ms->addr_64 || data->addr >=3D ms->addr_64 + len)
> +		return 0;
> +	/* Found, save it and stop walking. */
> +	data->start =3D ms->addr_64;
> +	data->end =3D ms->addr_64 + len;
> +	data->msl =3D msl;
> +	return 1;
> +}
> +
> +/**
> + * Create a new global Memroy Region (MR) for a missing virtual address.
> + * Register entire virtually contiguous memory chunk around the address.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or n=
ewly
> + *   created. If failed to create one, this will not be updated.
> + * @param addr
> + *   Target virtual address to register.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is se=
t.
> + */
> +static uint32_t
> +mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
> +	       uintptr_t addr)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct rte_mem_config *mcfg =3D rte_eal_get_configuration()-
> >mem_config;
> +	const struct rte_memseg_list *msl;
> +	const struct rte_memseg *ms;
> +	struct mlx5_mr *mr =3D NULL;
> +	size_t len;
> +	uint32_t ms_n;
> +	uint32_t bmp_size;
> +	void *bmp_mem;
> +	int ms_idx_shift =3D -1;
> +	unsigned int n;
> +	struct mr_find_contig_memsegs_data data =3D {
> +		.addr =3D addr,
> +	};
> +	struct mr_find_contig_memsegs_data data_re;
> +
> +	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
> +		dev->data->port_id, (void *)addr);
> +	if (rte_eal_process_type() !=3D RTE_PROC_PRIMARY) {
> +		DRV_LOG(WARNING,
> +			"port %u using address (%p) of unregistered
> mempool"
> +			" in secondary process, please create mempool"
> +			" before rte_eth_dev_start()",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D EPERM;
> +		goto err_nolock;
> +	}
> +	/*
> +	 * Release detached MRs if any. This can't be called with holding
> either
> +	 * memory_hotplug_lock or priv->mr.rwlock. MRs on the free list
> have
> +	 * been detached by the memory free event but it couldn't be
> released
> +	 * inside the callback due to deadlock. As a result, releasing resource=
s
> +	 * is quite opportunistic.
> +	 */
> +	mlx5_mr_garbage_collect(dev);
> +	/*
> +	 * Find out a contiguous virtual address chunk in use, to which the
> +	 * given address belongs, in order to register maximum range. In the
> +	 * best case where mempools are not dynamically recreated and
> +	 * '--socket-mem' is speicified as an EAL option, it is very likely to
> +	 * have only one MR(LKey) per a socket and per a hugepage-size
> even
> +	 * though the system memory is highly fragmented.
> +	 */
> +	if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data)) {
> +		DRV_LOG(WARNING,
> +			"port %u unable to find virtually contigous"
> +			" chunk for address (%p)."
> +			" rte_memseg_contig_walk() failed.",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D ENXIO;
> +		goto err_nolock;
> +	}
> +alloc_resources:
> +	/* Addresses must be page-aligned. */
> +	assert(rte_is_aligned((void *)data.start, data.msl->page_sz));
> +	assert(rte_is_aligned((void *)data.end, data.msl->page_sz));

Better to have this check outsize of assert.=20

> +	msl =3D data.msl;
> +	ms =3D rte_mem_virt2memseg((void *)data.start, msl);
> +	len =3D data.end - data.start;
> +	assert(msl->page_sz =3D=3D ms->hugepage_sz);
> +	/* Number of memsegs in the range. */
> +	ms_n =3D len / msl->page_sz;
> +	DRV_LOG(DEBUG,
> +		"port %u extending %p to [0x%lx, 0x%lx), page_sz=3D0x%lx,
> ms_n=3D%u",
> +		dev->data->port_id, (void *)addr,
> +		data.start, data.end, msl->page_sz, ms_n);
> +	/* Size of memory for bitmap. */
> +	bmp_size =3D rte_bitmap_get_memory_footprint(ms_n);
> +	mr =3D rte_zmalloc_socket(NULL,
> +				RTE_ALIGN_CEIL(sizeof(*mr),
> +					       RTE_CACHE_LINE_SIZE) +
> +				bmp_size,
> +				RTE_CACHE_LINE_SIZE, msl->socket_id);
> +	if (mr =3D=3D NULL) {
> +		DRV_LOG(WARNING,
> +			"port %u unable to allocate memory for a new MR
> of"
> +			" address (%p).",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D ENOMEM;
> +		goto err_nolock;
> +	}
> +	mr->msl =3D msl;
> +	/*
> +	 * Save the index of the first memseg and initialize memseg bitmap.
> To
> +	 * see if a memseg of ms_idx in the memseg-list is still valid, check:
> +	 *	rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx)
> +	 */
> +	mr->ms_base_idx =3D rte_fbarray_find_idx(&msl->memseg_arr, ms);
> +	bmp_mem =3D RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE);
> +	mr->ms_bmp =3D rte_bitmap_init(ms_n, bmp_mem, bmp_size);
> +	if (mr->ms_bmp =3D=3D NULL) {
> +		DRV_LOG(WARNING,
> +			"port %u unable to initialize bitamp for a new MR of"
> +			" address (%p).",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D EINVAL;
> +		goto err_nolock;
> +	}
> +	/*
> +	 * Should recheck whether the extended contiguous chunk is still
> valid.
> +	 * Because memory_hotplug_lock can't be held if there's any
> memory
> +	 * related calls in a critical path, resource allocation above can't be
> +	 * locked. If the memory has been changed at this point, try again
> with
> +	 * just single page. If not, go on with the big chunk atomically from
> +	 * here.
> +	 */
> +	rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
> +	data_re =3D data;
> +	if (len > msl->page_sz &&
> +	    !rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> &data_re)) {
> +		DRV_LOG(WARNING,
> +			"port %u unable to find virtually contigous"
> +			" chunk for address (%p)."
> +			" rte_memseg_contig_walk() failed.",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D ENXIO;
> +		goto err_memlock;
> +	}
> +	if (data.start !=3D data_re.start || data.end !=3D data_re.end) {
> +		/*
> +		 * The extended contiguous chunk has been changed. Try
> again
> +		 * with single memseg instead.
> +		 */
> +		data.start =3D RTE_ALIGN_FLOOR(addr, msl->page_sz);
> +		data.end =3D data.start + msl->page_sz;
> +		rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
> +		mr_free(mr);
> +		goto alloc_resources;
> +	}
> +	assert(data.msl =3D=3D data_re.msl);
> +	rte_rwlock_write_lock(&priv->mr.rwlock);
> +	/*
> +	 * Check the address is really missing. If other thread already created
> +	 * one or it is not found due to overflow, abort and return.
> +	 */
> +	if (mr_lookup_dev(dev, entry, addr) !=3D UINT32_MAX) {
> +		/*
> +		 * Insert to the global cache table. It may fail due to
> +		 * low-on-memory. Then, this entry will have to be searched
> +		 * here again.
> +		 */
> +		mr_btree_insert(&priv->mr.cache, entry);
> +		DRV_LOG(DEBUG,
> +			"port %u found MR for %p on final lookup, abort",
> +			dev->data->port_id, (void *)addr);
> +		rte_rwlock_write_unlock(&priv->mr.rwlock);
> +		rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
> +		/*
> +		 * Must be unlocked before calling rte_free() because
> +		 * mlx5_mr_mem_event_free_cb() can be called inside.
> +		 */
> +		mr_free(mr);
> +		return entry->lkey;
> +	}
> +	/*
> +	 * Trim start and end addresses for verbs MR. Set bits for registering
> +	 * memsegs but exclude already registered ones. Bitmap can be
> +	 * fragmented.
> +	 */
> +	for (n =3D 0; n < ms_n; ++n) {
> +		uintptr_t start;
> +		struct mlx5_mr_cache ret =3D { 0, };
> +
> +		start =3D data_re.start + n * msl->page_sz;
> +		/* Exclude memsegs already registered by other MRs. */
> +		if (mr_lookup_dev(dev, &ret, start) =3D=3D UINT32_MAX) {
> +			/*
> +			 * Start from the first unregistered memseg in the
> +			 * extended range.
> +			 */
> +			if (ms_idx_shift =3D=3D -1) {
> +				mr->ms_base_idx +=3D n;
> +				data.start =3D start;
> +				ms_idx_shift =3D n;
> +			}
> +			data.end =3D start + msl->page_sz;
> +			rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift);
> +			++mr->ms_n;
> +		}
> +	}
> +	len =3D data.end - data.start;
> +	mr->ms_bmp_n =3D len / msl->page_sz;
> +	assert(ms_idx_shift + mr->ms_bmp_n <=3D ms_n);
> +	/*
> +	 * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can
> be
> +	 * called with holding the memory lock because it doesn't use
> +	 * mlx5_alloc_buf_extern() which eventually calls
> rte_malloc_socket()
> +	 * through mlx5_alloc_verbs_buf().
> +	 */
> +	mr->ibv_mr =3D mlx5_glue->reg_mr(priv->pd, (void *)data.start, len,
> +				       IBV_ACCESS_LOCAL_WRITE);
> +	if (mr->ibv_mr =3D=3D NULL) {
> +		DRV_LOG(WARNING,
> +			"port %u fail to create a verbs MR for address (%p)",
> +			dev->data->port_id, (void *)addr);
> +		rte_errno =3D EINVAL;
> +		goto err_mrlock;
> +	}
> +	assert((uintptr_t)mr->ibv_mr->addr =3D=3D data.start);
> +	assert(mr->ibv_mr->length =3D=3D len);
> +	LIST_INSERT_HEAD(&priv->mr.mr_list, mr, mr);
> +	DRV_LOG(DEBUG,
> +		"port %u MR CREATED (%p) for %p:\n"
> +		"  [0x%lx, 0x%lx), lkey=3D0x%x base_idx=3D%u ms_n=3D%u,
> ms_bmp_n=3D%u",
> +		dev->data->port_id, (void *)mr, (void *)addr,
> +		data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +		mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n);
> +	/* Insert to the global cache table. */
> +	mr_insert_dev_cache(dev, mr);
> +	/* Fill in output data. */
> +	mr_lookup_dev(dev, entry, addr);
> +	/* Lookup can't fail. */
> +	assert(entry->lkey !=3D UINT32_MAX);
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +	rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
> +	return entry->lkey;
> +err_mrlock:
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +err_memlock:
> +	rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
> +err_nolock:
> +	/*
> +	 * In case of error, as this can be called in a datapath, a warning
> +	 * message per an error is preferable instead. Must be unlocked
> before
> +	 * calling rte_free() because mlx5_mr_mem_event_free_cb() can be
> called
> +	 * inside.
> +	 */
> +	mr_free(mr);
> +	return UINT32_MAX;
> +}
> +
> +/**
> + * Rebuild the global B-tree cache of device from the original MR list.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + */
> +static void
> +mr_rebuild_dev_cache(struct rte_eth_dev *dev)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr *mr;
> +
> +	DRV_LOG(DEBUG, "port %u rebuild dev cache[]", dev->data-
> >port_id);
> +	/* Flush cache to rebuild. */
> +	priv->mr.cache.len =3D 1;
> +	priv->mr.cache.overflow =3D 0;
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &priv->mr.mr_list, mr)
> +		if (mr_insert_dev_cache(dev, mr) < 0)
> +			return;
> +}
> +
> +/**
> + * Callback for memory free event. Iterate freed memsegs and check
> whether it
> + * belongs to an existing MR. If found, clear the bit from bitmap of MR.=
 As a
> + * result, the MR would be fragmented. If it becomes empty, the MR will =
be
> freed
> + * later by mlx5_mr_garbage_collect(). Even if this callback is called f=
rom a
> + * secondary process, the garbage collector will be called in primary pr=
ocess
> + * as the secondary process can't call mlx5_mr_create().
> + *
> + * The global cache must be rebuilt if there's any change and this event=
 has
> to
> + * be propagated to dataplane threads to flush the local caches.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param addr
> + *   Address of freed memory.
> + * @param len
> + *   Size of freed memory.
> + */
> +static void
> +mlx5_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr,
> size_t len)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	const struct rte_memseg_list *msl;
> +	struct mlx5_mr *mr;
> +	int ms_n;
> +	int i;
> +	int rebuild =3D 0;
> +
> +	DRV_LOG(DEBUG, "port %u free callback: addr=3D%p, len=3D%lu",
> +		dev->data->port_id, addr, len);
> +	msl =3D rte_mem_virt2memseg_list(addr);
> +	/* addr and len must be page-aligned. */
> +	assert((uintptr_t)addr =3D=3D RTE_ALIGN((uintptr_t)addr, msl-
> >page_sz));
> +	assert(len =3D=3D RTE_ALIGN(len, msl->page_sz));
> +	ms_n =3D len / msl->page_sz;
> +	rte_rwlock_write_lock(&priv->mr.rwlock);
> +	/* Clear bits of freed memsegs from MR. */
> +	for (i =3D 0; i < ms_n; ++i) {
> +		const struct rte_memseg *ms;
> +		struct mlx5_mr_cache entry;
> +		uintptr_t start;
> +		int ms_idx;
> +		uint32_t pos;
> +
> +		/* Find MR having this memseg. */
> +		start =3D (uintptr_t)addr + i * msl->page_sz;
> +		mr =3D mr_lookup_dev_list(dev, &entry, start);
> +		if (mr =3D=3D NULL)
> +			continue;
> +		ms =3D rte_mem_virt2memseg((void *)start, msl);
> +		assert(ms !=3D NULL);
> +		assert(msl->page_sz =3D=3D ms->hugepage_sz);
> +		ms_idx =3D rte_fbarray_find_idx(&msl->memseg_arr, ms);
> +		pos =3D ms_idx - mr->ms_base_idx;
> +		assert(rte_bitmap_get(mr->ms_bmp, pos));
> +		assert(pos < mr->ms_bmp_n);
> +		DRV_LOG(DEBUG, "port %u MR(%p): clear bitmap[%u] for
> addr %p",
> +			dev->data->port_id, (void *)mr, pos, (void *)start);
> +		rte_bitmap_clear(mr->ms_bmp, pos);
> +		if (--mr->ms_n =3D=3D 0) {
> +			LIST_REMOVE(mr, mr);
> +			LIST_INSERT_HEAD(&priv->mr.mr_free_list, mr, mr);
> +			DRV_LOG(DEBUG, "port %u remove MR(%p) from
> list",
> +				dev->data->port_id, (void *)mr);
> +		}
> +		/*
> +		 * MR is fragmented or will be freed. the global cache must
> be
> +		 * rebuilt.
> +		 */
> +		rebuild =3D 1;
> +	}
> +	if (rebuild) {
> +		mr_rebuild_dev_cache(dev);
> +		/*
> +		 * Flush local caches by propagating invalidation across cores.
> +		 * rte_smp_wmb() is enough to synchronize this event. If
> one of
> +		 * freed memsegs is seen by other core, that means the
> memseg
> +		 * has been allocated by allocator, which will come after this
> +		 * free call. Therefore, this store instruction (incrementing
> +		 * generation below) will be guaranteed to be seen by other
> core
> +		 * before the core sees the newly allocated memory.
> +		 */
> +		++priv->mr.dev_gen;
> +		DRV_LOG(DEBUG, "broadcasting local cache flush, gen=3D%d",
> +			priv->mr.dev_gen);
> +		rte_smp_wmb();
> +	}
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +	if (rebuild && rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG)
> +		mlx5_mr_dump_dev(dev);
> +}
> +
> +/**
> + * Callback for memory event. This can be called from both primary and
> secondary
> + * process.
> + *
> + * @param event_type
> + *   Memory event type.
> + * @param addr
> + *   Address of memory.
> + * @param len
> + *   Size of memory.
> + */
> +void
> +mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void
> *addr,
> +		     size_t len)
> +{
> +	struct priv *priv;
> +	struct mlx5_dev_list *dev_list =3D &mlx5_shared_data-
> >mem_event_cb_list;
> +
> +	switch (event_type) {
> +	case RTE_MEM_EVENT_FREE:
> +		rte_rwlock_write_lock(&mlx5_shared_data-
> >mem_event_rwlock);
> +		/* Iterate all the existing mlx5 devices. */
> +		LIST_FOREACH(priv, dev_list, mem_event_cb)
> +			mlx5_mr_mem_event_free_cb(eth_dev(priv), addr,
> len);
> +		rte_rwlock_write_unlock(&mlx5_shared_data-
> >mem_event_rwlock);
> +		break;
> +	case RTE_MEM_EVENT_ALLOC:
> +	default:
> +		break;
> +	}
> +}
> +
> +/**
> + * Look up address in the global MR cache table. If not found, create a =
new
> MR.
> + * Insert the found/created entry to local bottom-half cache table.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param[out] entry
> + *   Pointer to returning MR cache entry, found in the global cache or n=
ewly
> + *   created. If failed to create one, this is not written.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
> +		   struct mlx5_mr_cache *entry, uintptr_t addr)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr_btree *bt =3D &mr_ctrl->cache_bh;
> +	uint16_t idx;
> +	uint32_t lkey;
> +
> +	/* If local cache table is full, try to double it. */
> +	if (unlikely(bt->len =3D=3D bt->size))
> +		mr_btree_expand(bt, bt->size << 1);
> +	/* Look up in the global cache. */
> +	rte_rwlock_read_lock(&priv->mr.rwlock);
> +	lkey =3D mr_btree_lookup(&priv->mr.cache, &idx, addr);
> +	if (lkey !=3D UINT32_MAX) {
> +		/* Found. */
> +		*entry =3D (*priv->mr.cache.table)[idx];
> +		rte_rwlock_read_unlock(&priv->mr.rwlock);
> +		/*
> +		 * Update local cache. Even if it fails, return the found entry
> +		 * to update top-half cache. Next time, this entry will be
> found
> +		 * in the global cache.
> +		 */
> +		mr_btree_insert(bt, entry);
> +		return lkey;
> +	}
> +	rte_rwlock_read_unlock(&priv->mr.rwlock);
> +	/* First time to see the address? Create a new MR. */
> +	lkey =3D mlx5_mr_create(dev, entry, addr);

Shouldn't we check if the add is not in the global mr list? For the case th=
e global cache overflows?=20

> +	/*
> +	 * Update the local cache if successfully created a new global MR.
> Even
> +	 * if failed to create one, there's no action to take in this datapath
> +	 * code. As returning LKey is invalid, this will eventually make HW
> +	 * fail.
> +	 */
> +	if (lkey !=3D UINT32_MAX)
> +		mr_btree_insert(bt, entry);
> +	return lkey;
> +}
> +
> +/**
> + * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] =
and
> if
> + * misses, search in the global MR cache table and update the new entry =
to
> + * per-queue local caches.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static uint32_t
> +mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
> +		   uintptr_t addr)
> +{
> +	uint32_t lkey;
> +	uint16_t bh_idx =3D 0;
> +	/* Victim in top-half cache to replace with new entry. */
> +	struct mlx5_mr_cache *repl =3D &mr_ctrl->cache[mr_ctrl->head];
> +
> +	/* Binary-search MR translation table. */
> +	lkey =3D mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr);
> +	/* Update top-half cache. */
> +	if (likely(lkey !=3D UINT32_MAX)) {
> +		*repl =3D (*mr_ctrl->cache_bh.table)[bh_idx];
> +	} else {
> +		/*
> +		 * If missed in local lookup table, search in the global cache
> +		 * and local cache_bh[] will be updated inside if possible.
> +		 * Top-half cache entry will also be updated.
> +		 */
> +		lkey =3D mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr);
> +		if (unlikely(lkey =3D=3D UINT32_MAX))
> +			return UINT32_MAX;
> +	}
> +	/* Update the most recently used entry. */
> +	mr_ctrl->mru =3D mr_ctrl->head;
> +	/* Point to the next victim, the oldest. */
> +	mr_ctrl->head =3D (mr_ctrl->head + 1) % MLX5_MR_CACHE_N;
> +	return lkey;
> +}
> +
> +/**
> + * Bottom-half of LKey search on Rx.
> + *
> + * @param rxq
> + *   Pointer to Rx queue structure.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +uint32_t
> +mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
> +{
> +	struct mlx5_rxq_ctrl *rxq_ctrl =3D
> +		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
> +	struct mlx5_mr_ctrl *mr_ctrl =3D &rxq->mr_ctrl;
> +	struct priv *priv =3D rxq_ctrl->priv;
> +
> +	DRV_LOG(DEBUG,
> +		"Rx queue %u: miss on top-half, mru=3D%u, head=3D%u,
> addr=3D%p",
> +		rxq_ctrl->idx, mr_ctrl->mru, mr_ctrl->head, (void *)addr);
> +	return mlx5_mr_addr2mr_bh(eth_dev(priv), mr_ctrl, addr);
> +}

Shouldn't this code path be in the mlxx5_rxq?=20

> +
> +/**
> + * Bottom-half of LKey search on Tx.
> + *
> + * @param txq
> + *   Pointer to Tx queue structure.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +uint32_t
> +mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr)
> +{
> +	struct mlx5_txq_ctrl *txq_ctrl =3D
> +		container_of(txq, struct mlx5_txq_ctrl, txq);
> +	struct mlx5_mr_ctrl *mr_ctrl =3D &txq->mr_ctrl;
> +	struct priv *priv =3D txq_ctrl->priv;
> +
> +	DRV_LOG(DEBUG,
> +		"Tx queue %u: miss on top-half, mru=3D%u, head=3D%u,
> addr=3D%p",
> +		txq_ctrl->idx, mr_ctrl->mru, mr_ctrl->head, (void *)addr);
> +	return mlx5_mr_addr2mr_bh(eth_dev(priv), mr_ctrl, addr);
> +}
> +

Same for txq.=20

> +/**
> + * Flush all of the local cache entries.
> + *
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + */
> +void
> +mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl)
> +{
> +	/* Reset the most-recently-used index. */
> +	mr_ctrl->mru =3D 0;
> +	/* Reset the linear search array. */
> +	mr_ctrl->head =3D 0;
> +	memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache));
> +	/* Reset the B-tree table. */
> +	mr_ctrl->cache_bh.len =3D 1;
> +	mr_ctrl->cache_bh.overflow =3D 0;
> +	/* Update the generation number. */
> +	mr_ctrl->cur_gen =3D *mr_ctrl->dev_gen_ptr;
> +	DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=3D%d",
> +		(void *)mr_ctrl, mr_ctrl->cur_gen);
> +}
> +
> +/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */
> +static void
> +mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void
> *opaque,
> +		     struct rte_mempool_memhdr *memhdr,
> +		     unsigned mem_idx __rte_unused)
> +{
> +	struct mr_update_mp_data *data =3D opaque;
> +	uint32_t lkey;
> +
> +	/* Stop iteration if failed in the previous walk. */
> +	if (data->ret < 0)
> +		return;
> +	/* Register address of the chunk and update local caches. */
> +	lkey =3D mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl,
> +				  (uintptr_t)memhdr->addr);
> +	if (lkey =3D=3D UINT32_MAX)
> +		data->ret =3D -1;
> +}
> +
> +/**
> + * Register entire memory chunks in a Mempool.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param mr_ctrl
> + *   Pointer to per-queue MR control structure.
> + * @param mp
> + *   Pointer to registering Mempool.
> + *
> + * @return
> + *   0 on success, -1 on failure.
> + */
> +int
> +mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
> +		  struct rte_mempool *mp)
> +{
> +	struct mr_update_mp_data data =3D {
> +		.dev =3D dev,
> +		.mr_ctrl =3D mr_ctrl,
> +		.ret =3D 0,
> +	};
> +
> +	rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data);
> +	return data.ret;
> +}
> +
> +/**
> + * Dump all the created MRs and the global cache entries.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + */
> +void
> +mlx5_mr_dump_dev(struct rte_eth_dev *dev)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr *mr;
> +	int mr_n =3D 0;
> +	int chunk_n =3D 0;
> +
> +	rte_rwlock_read_lock(&priv->mr.rwlock);
> +	/* Iterate all the existing MRs. */
> +	LIST_FOREACH(mr, &priv->mr.mr_list, mr) {
> +		unsigned int n;
> +
> +		DRV_LOG(DEBUG,
> +			"port %u MR[%u], LKey =3D 0x%x, ms_n =3D %u,
> ms_bmp_n =3D %u",
> +			dev->data->port_id, mr_n++,
> +			rte_cpu_to_be_32(mr->ibv_mr->lkey),
> +			mr->ms_n, mr->ms_bmp_n);
> +		if (mr->ms_n =3D=3D 0)
> +			continue;
> +		for (n =3D 0; n < mr->ms_bmp_n; ) {
> +			struct mlx5_mr_cache ret =3D { 0, };
> +
> +			n =3D mr_find_next_chunk(mr, &ret, n);
> +			if (!ret.end)
> +				break;
> +			DRV_LOG(DEBUG, "  chunk[%u], [0x%lx, 0x%lx)",
> +				chunk_n++, ret.start, ret.end);
> +		}
> +	}
> +	DRV_LOG(DEBUG, "port %u dumping global cache", dev->data-
> >port_id);
> +	mlx5_mr_btree_dump(&priv->mr.cache);
> +	rte_rwlock_read_unlock(&priv->mr.rwlock);
> +}
> +
> +/**
> + * Release all the created MRs and resources. Remove device from memory
> callback
> + * list.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + */
> +void
> +mlx5_mr_release(struct rte_eth_dev *dev)
> +{
> +	struct priv *priv =3D dev->data->dev_private;
> +	struct mlx5_mr *mr_next =3D LIST_FIRST(&priv->mr.mr_list);
> +
> +	/* Remove from memory callback device list. */
> +	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
> +	LIST_REMOVE(priv, mem_event_cb);
> +	rte_rwlock_write_unlock(&mlx5_shared_data-
> >mem_event_rwlock);
> +	if (rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG)
> +		mlx5_mr_dump_dev(dev);
> +	rte_rwlock_write_lock(&priv->mr.rwlock);
> +	/* Detach from MR list and move to free list. */
> +	while (mr_next !=3D NULL) {
> +		struct mlx5_mr *mr =3D mr_next;
> +
> +		mr_next =3D LIST_NEXT(mr, mr);
> +		LIST_REMOVE(mr, mr);
> +		LIST_INSERT_HEAD(&priv->mr.mr_free_list, mr, mr);
> +	}
> +	LIST_INIT(&priv->mr.mr_list);
> +	/* Free global cache. */
> +	mlx5_mr_btree_free(&priv->mr.cache);
> +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> +	/* Free all remaining MRs. */
> +	mlx5_mr_garbage_collect(dev);
> +}
> diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
> new file mode 100644
> index 000000000..a0a0ef755
> --- /dev/null
> +++ b/drivers/net/mlx5/mlx5_mr.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2018 6WIND S.A.
> + * Copyright 2018 Mellanox Technologies, Ltd
> + */
> +
> +#ifndef RTE_PMD_MLX5_MR_H_
> +#define RTE_PMD_MLX5_MR_H_
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +
> +/* Verbs header. */
> +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic ignored "-Wpedantic"
> +#endif
> +#include <infiniband/verbs.h>
> +#include <infiniband/mlx5dv.h>
> +#ifdef PEDANTIC
> +#pragma GCC diagnostic error "-Wpedantic"
> +#endif
> +
> +#include <rte_eal_memconfig.h>
> +#include <rte_ethdev.h>
> +#include <rte_rwlock.h>
> +#include <rte_bitmap.h>
> +
> +/* Memory Region object. */
> +struct mlx5_mr {
> +	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> +	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> +	const struct rte_memseg_list *msl;
> +	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> +	int ms_n; /* Number of memsegs in use. */
> +	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> +	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to
> MR. */
> +};
> +
> +/* Cache entry for Memory Region. */
> +struct mlx5_mr_cache {
> +	uintptr_t start; /* Start address of MR. */
> +	uintptr_t end; /* End address of MR. */
> +	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
> +} __rte_packed;
> +
> +/* MR Cache table for Binary search. */
> +struct mlx5_mr_btree {
> +	uint16_t len; /* Number of entries. */
> +	uint16_t size; /* Total number of entries. */
> +	int overflow; /* Mark failure of table expansion. */
> +	struct mlx5_mr_cache (*table)[];
> +} __rte_packed;
> +
> +/* Per-queue MR control descriptor. */
> +struct mlx5_mr_ctrl {
> +	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> +	uint32_t cur_gen; /* Generation number saved to flush caches. */
> +	uint16_t mru; /* Index of last hit entry in top-half cache. */
> +	uint16_t head; /* Index of the oldest entry in top-half cache. */
> +	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for
> top-half. */
> +	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
> +} __rte_packed;
> +
> +/* First entry must be NULL for comparison. */
> +#define MR_N(n) ((n) - 1)
> +
> +/* Whether there's only one entry in MR lookup table. */
> +#define IS_SINGLE_MR(n) (MR_N(n) =3D=3D 1)

MLX5_IS_SINGLE_MR

> +
> +extern struct mlx5_dev_list  mlx5_mem_event_cb_list;
> +extern rte_rwlock_t mlx5_mem_event_rwlock;
> +
> +void mlx5_mr_free(struct rte_eth_dev *dev, struct mlx5_mr *mr);
> +int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket);
> +void mlx5_mr_btree_free(struct mlx5_mr_btree *bt);
> +void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const
> void *addr,
> +			  size_t len);
> +int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> *mr_ctrl,
> +		      struct rte_mempool *mp);
> +void mlx5_mr_dump_dev(struct rte_eth_dev *dev);
> +void mlx5_mr_release(struct rte_eth_dev *dev);
> +
> +/**
> + * Look up LKey from given lookup table by linear search. Firstly look u=
p the
> + * last-hit entry. If miss, the entire array is searched. If found, upda=
te the
> + * last-hit index and return LKey.
> + *
> + * @param lkp_tbl
> + *   Pointer to lookup table.
> + * @param[in,out] cached_idx
> + *   Pointer to last-hit index.
> + * @param n
> + *   Size of lookup table.
> + * @param addr
> + *   Search key.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static __rte_always_inline uint32_t
> +mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t
> *cached_idx,
> +		     uint16_t n, uintptr_t addr)
> +{
> +	uint16_t idx;
> +
> +	if (likely(addr >=3D lkp_tbl[*cached_idx].start &&
> +		   addr < lkp_tbl[*cached_idx].end))
> +		return lkp_tbl[*cached_idx].lkey;
> +	for (idx =3D 0; idx < n && lkp_tbl[idx].start !=3D 0; ++idx) {
> +		if (addr >=3D lkp_tbl[idx].start &&
> +		    addr < lkp_tbl[idx].end) {
> +			/* Found. */
> +			*cached_idx =3D idx;
> +			return lkp_tbl[idx].lkey;
> +		}
> +	}
> +	return UINT32_MAX;
> +}
> +
> +#endif /* RTE_PMD_MLX5_MR_H_ */
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index d4fe1fed7..22e2f9673 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -789,7 +789,7 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev,
> uint16_t idx)
>  			.addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(buf,
>  								  uintptr_t)),
>  			.byte_count =3D rte_cpu_to_be_32(DATA_LEN(buf)),
> -			.lkey =3D UINT32_MAX,
> +			.lkey =3D mlx5_rx_mb2mr(rxq_data, buf),
>  		};
>  	}
>  	rxq_data->rq_db =3D rwq.dbrec;
> @@ -967,6 +967,11 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t
> idx, uint16_t desc,
>  		rte_errno =3D ENOMEM;
>  		return NULL;
>  	}
> +	if (mlx5_mr_btree_init(&tmpl->rxq.mr_ctrl.cache_bh,
> +			       MLX5_MR_BTREE_CACHE_N, socket)) {
> +		/* rte_errno is already set. */
> +		goto error;
> +	}
>  	tmpl->socket =3D socket;
>  	if (dev->data->dev_conf.intr_conf.rxq)
>  		tmpl->irq =3D 1;
> @@ -1120,6 +1125,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev,
> uint16_t idx)
>  		DRV_LOG(DEBUG, "port %u Rx queue %u: refcnt %d",
>  			dev->data->port_id, rxq_ctrl->idx,
>  			rte_atomic32_read(&rxq_ctrl->refcnt));
> +		mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
>  		LIST_REMOVE(rxq_ctrl, next);
>  		rte_free(rxq_ctrl);
>  		(*priv->rxqs)[idx] =3D NULL;
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index 56c243495..8a863c157 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1965,6 +1965,9 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf
> **pkts, uint16_t pkts_n)
>  		 * changes.
>  		 */
>  		wqe->addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(rep,
> uintptr_t));
> +		/* If there's only one MR, no need to replace LKey in WQE. */
> +		if (unlikely(!IS_SINGLE_MR(rxq->mr_ctrl.cache_bh.len)))
> +			wqe->lkey =3D mlx5_rx_mb2mr(rxq, rep);
>  		if (len > DATA_LEN(seg)) {
>  			len -=3D DATA_LEN(seg);
>  			++NB_SEGS(pkt);
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index e8cad51aa..74581cf9b 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -29,6 +29,7 @@
>=20
>  #include "mlx5_utils.h"
>  #include "mlx5.h"
> +#include "mlx5_mr.h"
>  #include "mlx5_autoconf.h"
>  #include "mlx5_defs.h"
>  #include "mlx5_prm.h"
> @@ -81,6 +82,7 @@ struct mlx5_rxq_data {
>  	uint16_t rq_ci;
>  	uint16_t rq_pi;
>  	uint16_t cq_ci;
> +	struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */
>  	volatile struct mlx5_wqe_data_seg(*wqes)[];
>  	volatile struct mlx5_cqe(*cqes)[];
>  	struct rxq_zip zip; /* Compressed context. */
> @@ -109,8 +111,8 @@ struct mlx5_rxq_ibv {
>  struct mlx5_rxq_ctrl {
>  	LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */
>  	rte_atomic32_t refcnt; /* Reference counter. */
> -	struct priv *priv; /* Back pointer to private data. */
>  	struct mlx5_rxq_ibv *ibv; /* Verbs elements. */
> +	struct priv *priv; /* Back pointer to private data. */
>  	struct mlx5_rxq_data rxq; /* Data path structure. */
>  	unsigned int socket; /* CPU socket ID for allocations. */
>  	uint32_t tunnel_types[16]; /* Tunnel type counter. */
> @@ -165,6 +167,7 @@ struct mlx5_txq_data {
>  	uint16_t inline_max_packet_sz; /* Max packet size for inlining. */
>  	uint32_t qp_num_8s; /* QP number shifted by 8. */
>  	uint64_t offloads; /* Offloads for Tx Queue. */
> +	struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */
>  	volatile struct mlx5_cqe (*cqes)[]; /* Completion queue. */
>  	volatile void *wqes; /* Work queue (use volatile to write into). */
>  	volatile uint32_t *qp_db; /* Work queue doorbell. */
> @@ -187,11 +190,11 @@ struct mlx5_txq_ibv {
>  struct mlx5_txq_ctrl {
>  	LIST_ENTRY(mlx5_txq_ctrl) next; /* Pointer to the next element. */
>  	rte_atomic32_t refcnt; /* Reference counter. */
> -	struct priv *priv; /* Back pointer to private data. */
>  	unsigned int socket; /* CPU socket ID for allocations. */
>  	unsigned int max_inline_data; /* Max inline data. */
>  	unsigned int max_tso_header; /* Max TSO header size. */
>  	struct mlx5_txq_ibv *ibv; /* Verbs queue object. */
> +	struct priv *priv; /* Back pointer to private data. */
>  	struct mlx5_txq_data txq; /* Data path structure. */
>  	off_t uar_mmap_offset; /* UAR mmap offset for non-primary
> process. */
>  	volatile void *bf_reg_orig; /* Blueflame register from verbs. */
> @@ -308,6 +311,12 @@ uint16_t mlx5_tx_burst_vec(void *dpdk_txq, struct
> rte_mbuf **pkts,
>  uint16_t mlx5_rx_burst_vec(void *dpdk_txq, struct rte_mbuf **pkts,
>  			   uint16_t pkts_n);
>=20
> +/* mlx5_mr.c */
> +
> +void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl);
> +uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr);
> +uint32_t mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr);
> +
>  #ifndef NDEBUG
>  /**
>   * Verify or set magic value in CQE.
> @@ -493,14 +502,66 @@ mlx5_tx_complete(struct mlx5_txq_data *txq)
>  	*txq->cq_db =3D rte_cpu_to_be_32(cq_ci);
>  }
>=20
> +/**
> + * Query LKey from a packet buffer for Rx. No need to flush local caches=
 for
> Rx
> + * as mempool is pre-configured and static.
> + *
> + * @param rxq
> + *   Pointer to Rx queue structure.
> + * @param addr
> + *   Address to search.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
>  static __rte_always_inline uint32_t
> -mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb)
> +mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
>  {
> -	(void)txq;
> -	(void)mb;
> -	return UINT32_MAX;
> +	struct mlx5_mr_ctrl *mr_ctrl =3D &rxq->mr_ctrl;
> +	uint32_t lkey;
> +
> +	/* Linear search on MR cache array. */
> +	lkey =3D mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> +				    MLX5_MR_CACHE_N, addr);
> +	if (likely(lkey !=3D UINT32_MAX))
> +		return lkey;
> +	/* Take slower bottom-half (Binary Search) on miss. */
> +	return mlx5_rx_addr2mr_bh(rxq, addr);
>  }
>=20
> +#define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)-
> >buf_addr))
> +
> +/**
> + * Query LKey from a packet buffer for Tx. If not found, add the mempool=
.
> + *
> + * @param txq
> + *   Pointer to Tx queue structure.
> + * @param addr
> + *   Address to search.
> + *
> + * @return
> + *   Searched LKey on success, UINT32_MAX on no match.
> + */
> +static __rte_always_inline uint32_t
> +mlx5_tx_addr2mr(struct mlx5_txq_data *txq, uintptr_t addr)
> +{
> +	struct mlx5_mr_ctrl *mr_ctrl =3D &txq->mr_ctrl;
> +	uint32_t lkey;
> +
> +	/* Check generation bit to see if there's any change on existing MRs.
> */
> +	if (unlikely(*mr_ctrl->dev_gen_ptr !=3D mr_ctrl->cur_gen))
> +		mlx5_mr_flush_local_cache(mr_ctrl);
> +	/* Linear search on MR cache array. */
> +	lkey =3D mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru,
> +				    MLX5_MR_CACHE_N, addr);
> +	if (likely(lkey !=3D UINT32_MAX))
> +		return lkey;
> +	/* Take slower bottom-half (binary search) on miss. */
> +	return mlx5_tx_addr2mr_bh(txq, addr);
> +}
> +
> +#define mlx5_tx_mb2mr(rxq, mb) mlx5_tx_addr2mr(rxq, (uintptr_t)((mb)-
> >buf_addr))
> +
>  /**
>   * Ring TX queue doorbell and flush the update if requested.
>   *
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> index 56c5a1b0c..76678a820 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> @@ -99,9 +99,13 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq, uint16_t n)
>  		rxq->stats.rx_nombuf +=3D n;
>  		return;
>  	}
> -	for (i =3D 0; i < n; ++i)
> +	for (i =3D 0; i < n; ++i) {
>  		wq[i].addr =3D rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
> +
>  					      RTE_PKTMBUF_HEADROOM);
> +		/* If there's only one MR, no need to replace LKey in WQE. */
> +		if (unlikely(!IS_SINGLE_MR(rxq->mr_ctrl.cache_bh.len)))
> +			wq[i].lkey =3D mlx5_rx_mb2mr(rxq, elts[i]);
> +	}
>  	rxq->rq_ci +=3D n;
>  	/* Prevent overflowing into consumed mbufs. */
>  	elts_idx =3D rxq->rq_ci & q_mask;
> diff --git a/drivers/net/mlx5/mlx5_trigger.c
> b/drivers/net/mlx5/mlx5_trigger.c
> index 3db6c3f35..36b7c9e2f 100644
> --- a/drivers/net/mlx5/mlx5_trigger.c
> +++ b/drivers/net/mlx5/mlx5_trigger.c
> @@ -104,9 +104,18 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
>=20
>  	for (i =3D 0; i !=3D priv->rxqs_n; ++i) {
>  		struct mlx5_rxq_ctrl *rxq_ctrl =3D mlx5_rxq_get(dev, i);
> +		struct rte_mempool *mp;
>=20
>  		if (!rxq_ctrl)
>  			continue;
> +		/* Pre-register Rx mempool. */
> +		mp =3D rxq_ctrl->rxq.mp;
> +		DRV_LOG(DEBUG,
> +			"port %u Rx queue %u registering"
> +			" mp %s having %u chunks",
> +			dev->data->port_id, rxq_ctrl->idx,
> +			mp->name, mp->nb_mem_chunks);
> +		mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl, mp);
>  		ret =3D rxq_alloc_elts(rxq_ctrl);
>  		if (ret)
>  			goto error;
> @@ -154,6 +163,8 @@ mlx5_dev_start(struct rte_eth_dev *dev)
>  			dev->data->port_id, strerror(rte_errno));
>  		goto error;
>  	}
> +	if (rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG)
> +		mlx5_mr_dump_dev(dev);
>  	ret =3D mlx5_rx_intr_vec_enable(dev);
>  	if (ret) {
>  		DRV_LOG(ERR, "port %u Rx interrupt vector creation failed",
> diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
> index a71f3d0f0..9ce6f2098 100644
> --- a/drivers/net/mlx5/mlx5_txq.c
> +++ b/drivers/net/mlx5/mlx5_txq.c
> @@ -804,6 +804,13 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t
> idx, uint16_t desc,
>  		rte_errno =3D ENOMEM;
>  		return NULL;
>  	}
> +	if (mlx5_mr_btree_init(&tmpl->txq.mr_ctrl.cache_bh,
> +			       MLX5_MR_BTREE_CACHE_N, socket)) {
> +		/* rte_errno is already set. */
> +		goto error;
> +	}
> +	/* Save pointer of global generation number to check memory
> event. */
> +	tmpl->txq.mr_ctrl.dev_gen_ptr =3D &priv->mr.dev_gen;
>  	assert(desc > MLX5_TX_COMP_THRESH);
>  	tmpl->txq.offloads =3D conf->offloads;
>  	tmpl->priv =3D priv;
> @@ -823,6 +830,9 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t
> idx, uint16_t desc,
>  		idx, rte_atomic32_read(&tmpl->refcnt));
>  	LIST_INSERT_HEAD(&priv->txqsctrl, tmpl, next);
>  	return tmpl;
> +error:
> +	rte_free(tmpl);
> +	return NULL;
>  }
>=20
>  /**
> @@ -882,6 +892,7 @@ mlx5_txq_release(struct rte_eth_dev *dev, uint16_t
> idx)
>  			dev->data->port_id, txq->idx,
>  			rte_atomic32_read(&txq->refcnt));
>  		txq_free_elts(txq);
> +		mlx5_mr_btree_free(&txq->txq.mr_ctrl.cache_bh);
>  		LIST_REMOVE(txq, next);
>  		rte_free(txq);
>  		(*priv->txqs)[idx] =3D NULL;
> --
> 2.11.0