From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-HE1-obe.outbound.protection.outlook.com (mail-eopbgr10056.outbound.protection.outlook.com [40.107.1.56]) by dpdk.org (Postfix) with ESMTP id 84DB216E for ; Sun, 6 May 2018 14:53:21 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=WbUHL5ROfQY5W5sV7CMuwfQJ7C8RAiCvqilrQ2RCp2U=; b=x8QdjeeX9buH8HOzHDwpEdq4fsfyDFWmBlmxRmwjggygOyICzNe5YdknhXd4hdiRrlzOG7Yp8OBBeJsN60VuiFfSgJSyRwOBddZBwp8ea0kqA5Onjh8vTwZZ2pItAyaF+/KZXnQb6L0R5UFCB1yO5VX354P8mkPKRXhRygM/QOc= Received: from DB7PR05MB4426.eurprd05.prod.outlook.com (52.134.109.15) by DB7PR05MB4124.eurprd05.prod.outlook.com (52.134.107.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.735.18; Sun, 6 May 2018 12:53:19 +0000 Received: from DB7PR05MB4426.eurprd05.prod.outlook.com ([fe80::f116:5be4:ba29:fed8]) by DB7PR05MB4426.eurprd05.prod.outlook.com ([fe80::f116:5be4:ba29:fed8%13]) with mapi id 15.20.0735.018; Sun, 6 May 2018 12:53:19 +0000 From: Shahaf Shuler To: Yongseok Koh , Adrien Mazarguil , =?iso-8859-1?Q?N=E9lio_Laranjeiro?= CC: "dev@dpdk.org" , Yongseok Koh Thread-Topic: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support Thread-Index: AQHT4mvRJkOlRSyxWEe66wF/gXpknaQiRbwA Date: Sun, 6 May 2018 12:53:18 +0000 Message-ID: References: <20180502231654.7596-1-yskoh@mellanox.com> <20180502231654.7596-4-yskoh@mellanox.com> In-Reply-To: <20180502231654.7596-4-yskoh@mellanox.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=shahafs@mellanox.com; x-originating-ip: [31.154.10.107] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; DB7PR05MB4124; 7:qm02UhVkMJ2IfTUR8I7DVtAkM/QluqqKdjDYN0EniIqCGawJwv2o0AfzO60Noy9OkzSWYZCgOuEVjJHZ1NEYe25Np+9G6NT4DuFYIRsaBJewQpmScAyTRmAgYS9MEG/5/a6BJ2vWw4hde745ZPRILt3FVYPRNy7R1c8bso1sTdZ/n1ka/rfr7guCl3zBCKIPyhZjvmZj5UhM5C+eppKyDb14e9x7SoSMU9+1D2ixdvT1CEMUCicovLZJT5PIp4ip x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(5600026)(48565401081)(2017052603328)(7153060)(7193020); SRVR:DB7PR05MB4124; x-ms-traffictypediagnostic: DB7PR05MB4124: x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(209352067349851)(788757137089)(17755550239193); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(93006095)(93001095)(3002001)(10201501046)(3231254)(944501410)(52105095)(6055026)(6041310)(20161123558120)(20161123560045)(20161123562045)(20161123564045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(6072148)(201708071742011); SRVR:DB7PR05MB4124; BCL:0; PCL:0; RULEID:; SRVR:DB7PR05MB4124; x-forefront-prvs: 06640999CA x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(396003)(366004)(39380400002)(346002)(376002)(39860400002)(189003)(199004)(2906002)(45080400002)(486006)(53946003)(107886003)(102836004)(478600001)(6436002)(59450400001)(2900100001)(6506007)(9686003)(3660700001)(16200700003)(3280700002)(6246003)(53936002)(4326008)(229853002)(76176011)(66066001)(26005)(3846002)(97736004)(86362001)(7696005)(55016002)(8936002)(305945005)(11346002)(7736002)(6116002)(54906003)(106356001)(81166006)(81156014)(99286004)(110136005)(14454004)(8676002)(575784001)(68736007)(33656002)(105586002)(5660300001)(25786009)(316002)(74316002)(446003)(476003)(5250100002)(569006); DIR:OUT; SFP:1101; SCL:1; SRVR:DB7PR05MB4124; H:DB7PR05MB4426.eurprd05.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: pvTpXx6VfNuN22Im5BCS5BBA9QgsFwzfEMGWSbZ/Ov2rkW5P/G8U8A2twkRuIr+Tak4CQtpMu/XmgoiTckFiEsfOEUUfa63mHzeKwhkfVv/LtQkc0MlaVdDP+EdsWW7rbbD92srJ1PZnllOCEkvVpWHSgSQmbeL5fMPSiARDez0QnxrtK9pP9r2h1ec1+Hme spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Office365-Filtering-Correlation-Id: 396cccfb-b81f-4093-0ea8-08d5b35057a8 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-Network-Message-Id: 396cccfb-b81f-4093-0ea8-08d5b35057a8 X-MS-Exchange-CrossTenant-originalarrivaltime: 06 May 2018 12:53:19.0305 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR05MB4124 Subject: Re: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 May 2018 12:53:21 -0000 Hi Koh, Huge work. It takes (and will take) me some time to process.=20 In the meanwhile find some small comments.=20 As this design heavily relies on synchronization (cache flush) between the = control thread and the data path thread along with possible deadlocks from= the memory hotplug events the documentation is critical.=20 Otherwise future work will introduce heavy bugs.=20 Thursday, May 3, 2018 2:17 AM, Yongseok Koh: > Subject: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support >=20 > This is the new design of Memory Region (MR) for mlx PMD, in order to: > - Accommodate the new memory hotplug model. > - Support non-contiguous Mempool. This commit log is missing a lot of details about the design that you did. = You must make it clear for every Mellanox PMD developer.=20 Just to make sure I understand all the details: We have 1. Cache (L0) per rxq/txq in size of MLX5_MR_CACHE_N. searching in it start= ing from mru and fallback to linear search 2. btree (L1) per rxq/txq in dynamic size. searching using binary search. T= his is what you refer as the bottom half right?=20 3. global mr cache (L2) per device in dynamic size (?)=20 4. list of all MRs (L3) per device.=20 >=20 > Signed-off-by: Yongseok Koh > --- > drivers/net/mlx5/mlx5.c | 45 ++ > drivers/net/mlx5/mlx5.h | 22 + > drivers/net/mlx5/mlx5_defs.h | 6 + > drivers/net/mlx5/mlx5_ethdev.c | 16 + > drivers/net/mlx5/mlx5_mr.c | 1194 > ++++++++++++++++++++++++++++++++++++++ > drivers/net/mlx5/mlx5_mr.h | 121 ++++ > drivers/net/mlx5/mlx5_rxq.c | 8 +- > drivers/net/mlx5/mlx5_rxtx.c | 3 + > drivers/net/mlx5/mlx5_rxtx.h | 73 ++- > drivers/net/mlx5/mlx5_rxtx_vec.h | 6 +- > drivers/net/mlx5/mlx5_trigger.c | 11 + > drivers/net/mlx5/mlx5_txq.c | 11 + > 12 files changed, 1508 insertions(+), 8 deletions(-) > create mode 100644 drivers/net/mlx5/mlx5_mr.h >=20 > diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c > index 01d554758..2883f20af 100644 > --- a/drivers/net/mlx5/mlx5.c > +++ b/drivers/net/mlx5/mlx5.c > @@ -41,6 +41,7 @@ > #include "mlx5_autoconf.h" > #include "mlx5_defs.h" > #include "mlx5_glue.h" > +#include "mlx5_mr.h" >=20 > /* Device parameter to enable RX completion queue compression. */ > #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en" > @@ -84,10 +85,49 @@ > #define MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP (1 << 4) > #endif >=20 > +static const char *MZ_MLX5_PMD_SHARED_DATA =3D > "mlx5_pmd_shared_data"; > + > +/* Shared memory between primary and secondary processes. */ > +struct mlx5_shared_data *mlx5_shared_data; > + > +/* Spinlock for mlx5_shared_data allocation. */ > +static rte_spinlock_t mlx5_shared_data_lock =3D RTE_SPINLOCK_INITIALIZER= ; > + > /** Driver-specific log messages type. */ > int mlx5_logtype; >=20 > /** > + * Prepare shared data between primary and secondary process. > + */ > +static void > +mlx5_prepare_shared_data(void) > +{ > + const struct rte_memzone *mz; > + > + rte_spinlock_lock(&mlx5_shared_data_lock); > + if (mlx5_shared_data =3D=3D NULL) { > + if (rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY) { > + /* Allocate shared memory. */ > + mz =3D > rte_memzone_reserve(MZ_MLX5_PMD_SHARED_DATA, > + sizeof(*mlx5_shared_data), > + SOCKET_ID_ANY, 0); > + } else { > + /* Lookup allocated shared memory. */ > + mz =3D > rte_memzone_lookup(MZ_MLX5_PMD_SHARED_DATA); > + } > + if (mz =3D=3D NULL) > + rte_panic("Cannot allocate mlx5 shared data\n"); > + mlx5_shared_data =3D mz->addr; > + /* Initialize shared data. */ > + if (rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY) { > + LIST_INIT(&mlx5_shared_data- > >mem_event_cb_list); > + rte_rwlock_init(&mlx5_shared_data- > >mem_event_rwlock); > + } > + } > + rte_spinlock_unlock(&mlx5_shared_data_lock); > +} > + Can you elaborate why mlx5_shared_data can't be part of priv?=20 Priv is already allocated on the shared memory and rte_eth_dev layer enforc= e the secondary process creation as part of the rte_eth_dev_data allocation= .=20 > +/** > * Retrieve integer value from environment variable. > * > * @param[in] name > @@ -201,6 +241,7 @@ mlx5_dev_close(struct rte_eth_dev *dev) > priv->txqs =3D NULL; > } > mlx5_flow_delete_drop_queue(dev); > + mlx5_mr_release(dev); > if (priv->pd !=3D NULL) { > assert(priv->ctx !=3D NULL); > claim_zero(mlx5_glue->dealloc_pd(priv->pd)); > @@ -633,6 +674,8 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv > __rte_unused, > struct ibv_counter_set_description cs_desc; > #endif >=20 > + /* Prepare shared data between primary and secondary process. */ > + mlx5_prepare_shared_data(); > assert(pci_drv =3D=3D &mlx5_driver); > /* Get mlx5_dev[] index. */ > idx =3D mlx5_dev_idx(&pci_dev->addr); > @@ -1293,6 +1336,8 @@ rte_mlx5_pmd_init(void) > } > mlx5_glue->fork_init(); > rte_pci_register(&mlx5_driver); > + rte_mem_event_callback_register("MLX5_MEM_EVENT_CB", > + mlx5_mr_mem_event_cb); mlx5_mr_mem_event_cb requires PMD private structure. Is registering for the= cb on the init makes sense? It looks like a better place is the PCI probe,= after the eth_dev allocation.=20 > } >=20 > RTE_PMD_EXPORT_NAME(net_mlx5, __COUNTER__); > diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h > index 47d266c90..d3fc74dc1 100644 > --- a/drivers/net/mlx5/mlx5.h > +++ b/drivers/net/mlx5/mlx5.h > @@ -26,11 +26,13 @@ > #include > #include > #include > +#include > #include > #include > #include >=20 > #include "mlx5_utils.h" > +#include "mlx5_mr.h" > #include "mlx5_rxtx.h" > #include "mlx5_autoconf.h" > #include "mlx5_defs.h" > @@ -50,6 +52,16 @@ enum { > PCI_DEVICE_ID_MELLANOX_CONNECTX5EXVF =3D 0x101a, > }; >=20 > +LIST_HEAD(mlx5_dev_list, priv); > + > +/* Shared memory between primary and secondary processes. */ > +struct mlx5_shared_data { > + struct mlx5_dev_list mem_event_cb_list; > + rte_rwlock_t mem_event_rwlock; > +}; > + > +extern struct mlx5_shared_data *mlx5_shared_data; > + > struct mlx5_xstats_ctrl { > /* Number of device stats. */ > uint16_t stats_n; > @@ -119,7 +131,10 @@ struct mlx5_verbs_alloc_ctx { > const void *obj; /* Pointer to the DPDK object. */ > }; >=20 > +LIST_HEAD(mlx5_mr_list, mlx5_mr); > + > struct priv { > + LIST_ENTRY(priv) mem_event_cb; /* Called by memory event > callback. */ > struct rte_eth_dev_data *dev_data; /* Pointer to device data. */ > struct ibv_context *ctx; /* Verbs context. */ > struct ibv_device_attr_ex device_attr; /* Device properties. */ > @@ -146,6 +161,13 @@ struct priv { > struct mlx5_hrxq_drop *flow_drop_queue; /* Flow drop queue. */ > struct mlx5_flows flows; /* RTE Flow rules. */ > struct mlx5_flows ctrl_flows; /* Control flow rules. */ > + struct { > + uint32_t dev_gen; /* Generation number to flush local > caches. */ > + rte_rwlock_t rwlock; /* MR Lock. */ > + struct mlx5_mr_btree cache; /* Global MR cache table. */ > + struct mlx5_mr_list mr_list; /* Registered MR list. */ > + struct mlx5_mr_list mr_free_list; /* Freed MR list. */ > + } mr; > LIST_HEAD(rxq, mlx5_rxq_ctrl) rxqsctrl; /* DPDK Rx queues. */ > LIST_HEAD(rxqibv, mlx5_rxq_ibv) rxqsibv; /* Verbs Rx queues. */ > LIST_HEAD(hrxq, mlx5_hrxq) hrxqs; /* Verbs Hash Rx queues. */ > diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h > index f9093777d..72e80af26 100644 > --- a/drivers/net/mlx5/mlx5_defs.h > +++ b/drivers/net/mlx5/mlx5_defs.h > @@ -37,6 +37,12 @@ > */ > #define MLX5_TX_COMP_THRESH_INLINE_DIV (1 << 3) >=20 > +/* Size of per-queue MR cache array for linear search. */ > +#define MLX5_MR_CACHE_N 8 > + > +/* Size of MR cache table for binary search. */ > +#define MLX5_MR_BTREE_CACHE_N 256 > + > /* > * If defined, only use software counters. The PMD will never ask the > hardware > * for these, and many of them won't be available. > diff --git a/drivers/net/mlx5/mlx5_ethdev.c > b/drivers/net/mlx5/mlx5_ethdev.c > index 746b94f73..6bb43cf4e 100644 > --- a/drivers/net/mlx5/mlx5_ethdev.c > +++ b/drivers/net/mlx5/mlx5_ethdev.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include >=20 > #include "mlx5.h" > #include "mlx5_glue.h" > @@ -413,6 +414,21 @@ mlx5_dev_configure(struct rte_eth_dev *dev) > if (++j =3D=3D rxqs_n) > j =3D 0; > } > + /* > + * Once the device is added to the list of memory event callback, its > + * global MR cache table cannot be expanded on the fly because of > + * deadlock. If it overflows, lookup should be done by searching MR > list > + * linearly, which is slow. > + */ > + if (mlx5_mr_btree_init(&priv->mr.cache, > MLX5_MR_BTREE_CACHE_N * 2, Why multiple by 2? Because it holds all the rxq/txq mrs?=20 > + dev->device->numa_node)) { > + /* rte_errno is already set. */ > + return -rte_errno; > + } > + rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock); > + LIST_INSERT_HEAD(&mlx5_shared_data->mem_event_cb_list, > + priv, mem_event_cb); > + rte_rwlock_write_unlock(&mlx5_shared_data- > >mem_event_rwlock); > return 0; > } Why registration is done only on configure and not on probe after priv init= ialization?=20 >=20 > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c > index 736c40ae4..e964912bb 100644 > --- a/drivers/net/mlx5/mlx5_mr.c > +++ b/drivers/net/mlx5/mlx5_mr.c > @@ -13,8 +13,1202 @@ >=20 > #include > #include > +#include >=20 > #include "mlx5.h" > +#include "mlx5_mr.h" > #include "mlx5_rxtx.h" > #include "mlx5_glue.h" >=20 > +struct mr_find_contig_memsegs_data { > + uintptr_t addr; > + uintptr_t start; > + uintptr_t end; > + const struct rte_memseg_list *msl; > +}; > + > +struct mr_update_mp_data { > + struct rte_eth_dev *dev; > + struct mlx5_mr_ctrl *mr_ctrl; > + int ret; > +}; > + > +/** > + * Expand B-tree table to a given size. Can't be called with holding > + * memory_hotplug_lock or priv->mr.rwlock due to rte_realloc(). > + * > + * @param bt > + * Pointer to B-tree structure. > + * @param n > + * Number of entries for expansion. > + * > + * @return > + * 0 on success, -1 on failure. > + */ > +static int > +mr_btree_expand(struct mlx5_mr_btree *bt, int n) > +{ > + void *mem; > + int ret =3D 0; > + > + if (n <=3D bt->size) > + return ret; > + /* > + * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is > + * used inside if there's no room to expand. Because this is a quite > + * rare case and a part of very slow path, it is very acceptable. > + * Initially cache_bh[] will be given practically enough space and once > + * it is expanded, expansion wouldn't be needed again ever. > + */ > + mem =3D rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0); > + if (mem =3D=3D NULL) { > + /* Not an error, B-tree search will be skipped. */ > + DRV_LOG(WARNING, "failed to expand MR B-tree (%p) > table", > + (void *)bt); DRV_LOG should have the port id of the device. For all of the DRV_LOG insta= nces in the patch.=20 Per my understating it falls back to the old bt in case the expansion faile= d, right? Bt searches will still happen.=20 > + ret =3D -1; > + } else { > + DRV_LOG(DEBUG, "expanded MR B-tree table (size=3D%u)", > n); > + bt->table =3D mem; > + bt->size =3D n; > + } > + return ret; > +} > + > +/** > + * Look up LKey from given B-tree lookup table, store the last index and > return > + * searched LKey. > + * > + * @param bt > + * Pointer to B-tree structure. > + * @param[out] idx > + * Pointer to index. Even on searh failure, returns index where it sto= ps Searh->search=20 > + * searching so that index can be used when inserting a new entry. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +static uint32_t > +mr_btree_lookup(struct mlx5_mr_btree *bt, uint16_t *idx, uintptr_t addr) > +{ > + struct mlx5_mr_cache *lkp_tbl; > + uint16_t n; > + uint16_t base =3D 0; > + > + assert(bt !=3D NULL); > + lkp_tbl =3D *bt->table; > + n =3D bt->len; > + /* First entry must be NULL for comparison. */ > + assert(bt->len > 0 || (lkp_tbl[0].start =3D=3D 0 && > + lkp_tbl[0].lkey =3D=3D UINT32_MAX)); > + /* Binary search. */ > + do { > + register uint16_t delta =3D n >> 1; > + > + if (addr < lkp_tbl[base + delta].start) { > + n =3D delta; > + } else { > + base +=3D delta; > + n -=3D delta; > + } > + } while (n > 1); > + assert(addr >=3D lkp_tbl[base].start); > + *idx =3D base; > + if (addr < lkp_tbl[base].end) > + return lkp_tbl[base].lkey; > + /* Not found. */ > + return UINT32_MAX; > +} > + > +/** > + * Insert an entry to B-tree lookup table. > + * > + * @param bt > + * Pointer to B-tree structure. > + * @param entry > + * Pointer to new entry to insert. > + * > + * @return > + * 0 on success, -1 on failure. > + */ > +static int > +mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry) > +{ > + struct mlx5_mr_cache *lkp_tbl; > + uint16_t idx =3D 0; > + size_t shift; > + > + assert(bt !=3D NULL); > + assert(bt->len <=3D bt->size); > + assert(bt->len > 0); > + lkp_tbl =3D *bt->table; > + /* Find out the slot for insertion. */ > + if (mr_btree_lookup(bt, &idx, entry->start) !=3D UINT32_MAX) { > + DRV_LOG(DEBUG, > + "abort insertion to B-tree(%p):" > + " already exist at idx=3D%u [0x%lx, 0x%lx) lkey=3D0x%x", > + (void *)bt, idx, entry->start, entry->end, entry- > >lkey); > + /* Already exist, return. */ > + return 0; > + } > + /* If table is full, return error. */ > + if (unlikely(bt->len =3D=3D bt->size)) { > + bt->overflow =3D 1; > + return -1; > + } > + /* Insert entry. */ > + ++idx; > + shift =3D (bt->len - idx) * sizeof(struct mlx5_mr_cache); > + if (shift) > + memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift); > + lkp_tbl[idx] =3D *entry; > + bt->len++; > + DRV_LOG(DEBUG, > + "inserted B-tree(%p)[%u], [0x%lx, 0x%lx) lkey=3D0x%x", > + (void *)bt, idx, entry->start, entry->end, entry->lkey); > + return 0; > +} Can you elaborate on how you make sure the btree is always sorted based on = the start addr?=20 > + > +/** > + * Initialize B-tree and allocate memory for lookup table. > + * > + * @param bt > + * Pointer to B-tree structure. > + * @param n > + * Number of entries to allocate. > + * @param socket > + * NUMA socket on which memory must be allocated. > + * > + * @return > + * 0 on success, a negative errno value otherwise and rte_errno is set= . > + */ > +int > +mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket) > +{ > + if (bt =3D=3D NULL) { > + rte_errno =3D EINVAL; > + return -rte_errno; > + } > + memset(bt, 0, sizeof(*bt)); > + bt->table =3D rte_calloc_socket("B-tree table", > + n, sizeof(struct mlx5_mr_cache), > + 0, socket); > + if (bt->table =3D=3D NULL) { > + rte_errno =3D ENOMEM; > + DRV_LOG(ERR, > + "failed to allocate memory for btree cache on socket > %d", > + socket); > + return -rte_errno; > + } > + bt->size =3D n; > + /* First entry must be NULL for binary search. */ > + (*bt->table)[bt->len++] =3D (struct mlx5_mr_cache) { > + .lkey =3D UINT32_MAX, > + }; > + DRV_LOG(DEBUG, "initialized B-tree %p with table %p", > + (void *)bt, (void *)bt->table); > + return 0; > +} > + > +/** > + * Free B-tree resources. > + * > + * @param bt > + * Pointer to B-tree structure. > + */ > +void > +mlx5_mr_btree_free(struct mlx5_mr_btree *bt) > +{ > + if (bt =3D=3D NULL) > + return; > + DRV_LOG(DEBUG, "freeing B-tree %p with table %p", > + (void *)bt, (void *)bt->table); > + rte_free(bt->table); > + memset(bt, 0, sizeof(*bt)); > +} > + > +/** > + * Dump all the entries in a B-tree > + * > + * @param bt > + * Pointer to B-tree structure. > + */ > +static void > +mlx5_mr_btree_dump(struct mlx5_mr_btree *bt) > +{ > + int idx; > + struct mlx5_mr_cache *lkp_tbl; > + > + if (bt =3D=3D NULL) > + return; > + lkp_tbl =3D *bt->table; > + for (idx =3D 0; idx < bt->len; ++idx) { > + struct mlx5_mr_cache *entry =3D &lkp_tbl[idx]; > + > + DRV_LOG(DEBUG, > + "B-tree(%p)[%u], [0x%lx, 0x%lx) lkey=3D0x%x", > + (void *)bt, idx, entry->start, entry->end, entry- > >lkey); > + } > +} > + > +/** > + * Find virtually contiguous memory chunk in a given MR. > + * > + * @param dev > + * Pointer to MR structure. > + * @param[out] entry > + * Pointer to returning MR cache entry. If not found, this will not be > + * updated. > + * @param start_idx > + * Start index of the memseg bitmap. > + * > + * @return > + * Next index to go on lookup. > + */ > +static int > +mr_find_next_chunk(struct mlx5_mr *mr, struct mlx5_mr_cache *entry, > + int base_idx) > +{ > + uintptr_t start =3D 0; > + uintptr_t end =3D 0; > + uint32_t idx =3D 0; > + > + for (idx =3D base_idx; idx < mr->ms_bmp_n; ++idx) { > + if (rte_bitmap_get(mr->ms_bmp, idx)) { > + const struct rte_memseg_list *msl; > + const struct rte_memseg *ms; > + > + msl =3D mr->msl; > + ms =3D rte_fbarray_get(&msl->memseg_arr, > + mr->ms_base_idx + idx); > + assert(msl->page_sz =3D=3D ms->hugepage_sz); > + if (!start) > + start =3D ms->addr_64; > + end =3D ms->addr_64 + ms->hugepage_sz; > + } else if (start) { > + /* Passed the end of a fragment. */ > + break; > + } > + } > + if (start) { > + /* Found one chunk. */ > + entry->start =3D start; > + entry->end =3D end; > + entry->lkey =3D rte_cpu_to_be_32(mr->ibv_mr->lkey); > + } > + return idx; > +} > + > +/** > + * Insert a MR to the global B-tree cache. It may fail due to low-on-mem= ory. > + * Then, this entry will have to be searched by mr_lookup_dev_list() in > + * mlx5_mr_create() on miss. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param mr > + * Pointer to MR to insert. > + * > + * @return > + * 0 on success, -1 on failure. > + */ > +static int > +mr_insert_dev_cache(struct rte_eth_dev *dev, struct mlx5_mr *mr) > +{ > + struct priv *priv =3D dev->data->dev_private; > + unsigned int n; > + > + DRV_LOG(DEBUG, "port %u inserting MR(%p) to global cache", > + dev->data->port_id, (void *)mr); > + for (n =3D 0; n < mr->ms_bmp_n; ) { > + struct mlx5_mr_cache entry =3D { 0, }; > + > + /* Find a contiguous chunk and advance the index. */ > + n =3D mr_find_next_chunk(mr, &entry, n); > + if (!entry.end) > + break; > + if (mr_btree_insert(&priv->mr.cache, &entry) < 0) { > + /* > + * Overflowed, but the global table cannot be > expanded > + * because of deadlock. > + */ > + return -1; > + } > + } > + return 0; > +} > + > +/** > + * Look up address in the original global MR list. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param[out] entry > + * Pointer to returning MR cache entry. If no match, this will not be > updated. > + * @param addr > + * Search key. > + * > + * @return > + * Found MR on match, NULL otherwise. > + */ > +static struct mlx5_mr * > +mr_lookup_dev_list(struct rte_eth_dev *dev, struct mlx5_mr_cache > *entry, > + uintptr_t addr) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr *mr; > + > + /* Iterate all the existing MRs. */ > + LIST_FOREACH(mr, &priv->mr.mr_list, mr) { > + unsigned int n; > + > + if (mr->ms_n =3D=3D 0) > + continue; > + for (n =3D 0; n < mr->ms_bmp_n; ) { > + struct mlx5_mr_cache ret =3D { 0, }; > + > + n =3D mr_find_next_chunk(mr, &ret, n); > + if (addr >=3D ret.start && addr < ret.end) { > + /* Found. */ > + *entry =3D ret; > + return mr; > + } > + } > + } > + return NULL; > +} > + > +/** > + * Look up address on device. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param[out] entry > + * Pointer to returning MR cache entry. If no match, this will not be > updated. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on failure and rte_errno is se= t. > + */ > +static uint32_t > +mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry, > + uintptr_t addr) > +{ > + struct priv *priv =3D dev->data->dev_private; > + uint16_t idx; > + uint32_t lkey =3D UINT32_MAX; > + struct mlx5_mr *mr; > + > + /* > + * If the global cache has overflowed since it failed to expand the > + * B-tree table, it can't have all the exisitng MRs. Then, the address > + * has to be searched by traversing the original MR list instead, which > + * is very slow path. Otherwise, the global cache is all inclusive. > + */ > + if (!unlikely(priv->mr.cache.overflow)) { > + lkey =3D mr_btree_lookup(&priv->mr.cache, &idx, addr); > + if (lkey !=3D UINT32_MAX) > + *entry =3D (*priv->mr.cache.table)[idx]; > + } else { > + /* Falling back to the slowest path. */ > + mr =3D mr_lookup_dev_list(dev, entry, addr); > + if (mr !=3D NULL) > + lkey =3D entry->lkey; > + } > + assert(lkey =3D=3D UINT32_MAX || (addr >=3D entry->start && > + addr < entry->end)); > + return lkey; > +} > + > +/** > + * Free MR resources. MR lock must not be held to avoid a deadlock. > rte_free() > + * can raise memory free event and the callback function will spin on th= e > lock. > + * > + * @param mr > + * Pointer to MR to free. > + */ > +static void > +mr_free(struct mlx5_mr *mr) > +{ > + if (mr =3D=3D NULL) > + return; > + DRV_LOG(DEBUG, "freeing MR(%p):", (void *)mr); > + if (mr->ibv_mr !=3D NULL) > + claim_zero(mlx5_glue->dereg_mr(mr->ibv_mr)); > + if (mr->ms_bmp !=3D NULL) > + rte_bitmap_free(mr->ms_bmp); > + rte_free(mr); > +} > + > +/** > + * Free Memory Region (MR). > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param mr > + * Pointer to MR to free. > + */ > +void > +mlx5_mr_free(struct rte_eth_dev *dev, struct mlx5_mr *mr) > +{ Who calls this function? I didn't saw any.=20 > + struct priv *priv =3D dev->data->dev_private; > + > + /* Detach from the list and free resources later. */ > + rte_rwlock_write_lock(&priv->mr.rwlock); > + LIST_REMOVE(mr, mr); > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + /* > + * rte_free() inside can't be called with holding the lock. This could > + * cause deadlock when calling free callback. > + */ > + mr_free(mr); > + DRV_LOG(DEBUG, "port %u MR(%p) freed", dev->data->port_id, > (void *)mr); > +} > + > +/** > + * Releass resources of detached MR having no online entry. > + * > + * @param dev > + * Pointer to Ethernet device. > + */ > +static void > +mlx5_mr_garbage_collect(struct rte_eth_dev *dev) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr *mr_next; > + struct mlx5_mr_list free_list =3D LIST_HEAD_INITIALIZER(free_list); > + > + /* Must be called from the primary process. */ > + assert(rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY); Perhaps it is better to have this check not under assert? > + /* > + * MR can't be freed with holding the lock because rte_free() could > call > + * memory free callback function. This will be a deadlock situation. > + */ > + rte_rwlock_write_lock(&priv->mr.rwlock); > + /* Detach the whole free list and release it after unlocking. */ > + free_list =3D priv->mr.mr_free_list; > + LIST_INIT(&priv->mr.mr_free_list); > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + /* Release resources. */ > + mr_next =3D LIST_FIRST(&free_list); > + while (mr_next !=3D NULL) { > + struct mlx5_mr *mr =3D mr_next; > + > + mr_next =3D LIST_NEXT(mr, mr); > + mr_free(mr); > + } > +} > + > +/* Called during rte_memseg_contig_walk() by mlx5_mr_create(). */ > +static int > +mr_find_contig_memsegs_cb(const struct rte_memseg_list *msl, > + const struct rte_memseg *ms, size_t len, void *arg) > +{ > + struct mr_find_contig_memsegs_data *data =3D arg; > + > + if (data->addr < ms->addr_64 || data->addr >=3D ms->addr_64 + len) > + return 0; > + /* Found, save it and stop walking. */ > + data->start =3D ms->addr_64; > + data->end =3D ms->addr_64 + len; > + data->msl =3D msl; > + return 1; > +} > + > +/** > + * Create a new global Memroy Region (MR) for a missing virtual address. > + * Register entire virtually contiguous memory chunk around the address. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param[out] entry > + * Pointer to returning MR cache entry, found in the global cache or n= ewly > + * created. If failed to create one, this will not be updated. > + * @param addr > + * Target virtual address to register. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on failure and rte_errno is se= t. > + */ > +static uint32_t > +mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry, > + uintptr_t addr) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct rte_mem_config *mcfg =3D rte_eal_get_configuration()- > >mem_config; > + const struct rte_memseg_list *msl; > + const struct rte_memseg *ms; > + struct mlx5_mr *mr =3D NULL; > + size_t len; > + uint32_t ms_n; > + uint32_t bmp_size; > + void *bmp_mem; > + int ms_idx_shift =3D -1; > + unsigned int n; > + struct mr_find_contig_memsegs_data data =3D { > + .addr =3D addr, > + }; > + struct mr_find_contig_memsegs_data data_re; > + > + DRV_LOG(DEBUG, "port %u creating a MR using address (%p)", > + dev->data->port_id, (void *)addr); > + if (rte_eal_process_type() !=3D RTE_PROC_PRIMARY) { > + DRV_LOG(WARNING, > + "port %u using address (%p) of unregistered > mempool" > + " in secondary process, please create mempool" > + " before rte_eth_dev_start()", > + dev->data->port_id, (void *)addr); > + rte_errno =3D EPERM; > + goto err_nolock; > + } > + /* > + * Release detached MRs if any. This can't be called with holding > either > + * memory_hotplug_lock or priv->mr.rwlock. MRs on the free list > have > + * been detached by the memory free event but it couldn't be > released > + * inside the callback due to deadlock. As a result, releasing resource= s > + * is quite opportunistic. > + */ > + mlx5_mr_garbage_collect(dev); > + /* > + * Find out a contiguous virtual address chunk in use, to which the > + * given address belongs, in order to register maximum range. In the > + * best case where mempools are not dynamically recreated and > + * '--socket-mem' is speicified as an EAL option, it is very likely to > + * have only one MR(LKey) per a socket and per a hugepage-size > even > + * though the system memory is highly fragmented. > + */ > + if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb, > &data)) { > + DRV_LOG(WARNING, > + "port %u unable to find virtually contigous" > + " chunk for address (%p)." > + " rte_memseg_contig_walk() failed.", > + dev->data->port_id, (void *)addr); > + rte_errno =3D ENXIO; > + goto err_nolock; > + } > +alloc_resources: > + /* Addresses must be page-aligned. */ > + assert(rte_is_aligned((void *)data.start, data.msl->page_sz)); > + assert(rte_is_aligned((void *)data.end, data.msl->page_sz)); Better to have this check outsize of assert.=20 > + msl =3D data.msl; > + ms =3D rte_mem_virt2memseg((void *)data.start, msl); > + len =3D data.end - data.start; > + assert(msl->page_sz =3D=3D ms->hugepage_sz); > + /* Number of memsegs in the range. */ > + ms_n =3D len / msl->page_sz; > + DRV_LOG(DEBUG, > + "port %u extending %p to [0x%lx, 0x%lx), page_sz=3D0x%lx, > ms_n=3D%u", > + dev->data->port_id, (void *)addr, > + data.start, data.end, msl->page_sz, ms_n); > + /* Size of memory for bitmap. */ > + bmp_size =3D rte_bitmap_get_memory_footprint(ms_n); > + mr =3D rte_zmalloc_socket(NULL, > + RTE_ALIGN_CEIL(sizeof(*mr), > + RTE_CACHE_LINE_SIZE) + > + bmp_size, > + RTE_CACHE_LINE_SIZE, msl->socket_id); > + if (mr =3D=3D NULL) { > + DRV_LOG(WARNING, > + "port %u unable to allocate memory for a new MR > of" > + " address (%p).", > + dev->data->port_id, (void *)addr); > + rte_errno =3D ENOMEM; > + goto err_nolock; > + } > + mr->msl =3D msl; > + /* > + * Save the index of the first memseg and initialize memseg bitmap. > To > + * see if a memseg of ms_idx in the memseg-list is still valid, check: > + * rte_bitmap_get(mr->bmp, ms_idx - mr->ms_base_idx) > + */ > + mr->ms_base_idx =3D rte_fbarray_find_idx(&msl->memseg_arr, ms); > + bmp_mem =3D RTE_PTR_ALIGN_CEIL(mr + 1, RTE_CACHE_LINE_SIZE); > + mr->ms_bmp =3D rte_bitmap_init(ms_n, bmp_mem, bmp_size); > + if (mr->ms_bmp =3D=3D NULL) { > + DRV_LOG(WARNING, > + "port %u unable to initialize bitamp for a new MR of" > + " address (%p).", > + dev->data->port_id, (void *)addr); > + rte_errno =3D EINVAL; > + goto err_nolock; > + } > + /* > + * Should recheck whether the extended contiguous chunk is still > valid. > + * Because memory_hotplug_lock can't be held if there's any > memory > + * related calls in a critical path, resource allocation above can't be > + * locked. If the memory has been changed at this point, try again > with > + * just single page. If not, go on with the big chunk atomically from > + * here. > + */ > + rte_rwlock_read_lock(&mcfg->memory_hotplug_lock); > + data_re =3D data; > + if (len > msl->page_sz && > + !rte_memseg_contig_walk(mr_find_contig_memsegs_cb, > &data_re)) { > + DRV_LOG(WARNING, > + "port %u unable to find virtually contigous" > + " chunk for address (%p)." > + " rte_memseg_contig_walk() failed.", > + dev->data->port_id, (void *)addr); > + rte_errno =3D ENXIO; > + goto err_memlock; > + } > + if (data.start !=3D data_re.start || data.end !=3D data_re.end) { > + /* > + * The extended contiguous chunk has been changed. Try > again > + * with single memseg instead. > + */ > + data.start =3D RTE_ALIGN_FLOOR(addr, msl->page_sz); > + data.end =3D data.start + msl->page_sz; > + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); > + mr_free(mr); > + goto alloc_resources; > + } > + assert(data.msl =3D=3D data_re.msl); > + rte_rwlock_write_lock(&priv->mr.rwlock); > + /* > + * Check the address is really missing. If other thread already created > + * one or it is not found due to overflow, abort and return. > + */ > + if (mr_lookup_dev(dev, entry, addr) !=3D UINT32_MAX) { > + /* > + * Insert to the global cache table. It may fail due to > + * low-on-memory. Then, this entry will have to be searched > + * here again. > + */ > + mr_btree_insert(&priv->mr.cache, entry); > + DRV_LOG(DEBUG, > + "port %u found MR for %p on final lookup, abort", > + dev->data->port_id, (void *)addr); > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); > + /* > + * Must be unlocked before calling rte_free() because > + * mlx5_mr_mem_event_free_cb() can be called inside. > + */ > + mr_free(mr); > + return entry->lkey; > + } > + /* > + * Trim start and end addresses for verbs MR. Set bits for registering > + * memsegs but exclude already registered ones. Bitmap can be > + * fragmented. > + */ > + for (n =3D 0; n < ms_n; ++n) { > + uintptr_t start; > + struct mlx5_mr_cache ret =3D { 0, }; > + > + start =3D data_re.start + n * msl->page_sz; > + /* Exclude memsegs already registered by other MRs. */ > + if (mr_lookup_dev(dev, &ret, start) =3D=3D UINT32_MAX) { > + /* > + * Start from the first unregistered memseg in the > + * extended range. > + */ > + if (ms_idx_shift =3D=3D -1) { > + mr->ms_base_idx +=3D n; > + data.start =3D start; > + ms_idx_shift =3D n; > + } > + data.end =3D start + msl->page_sz; > + rte_bitmap_set(mr->ms_bmp, n - ms_idx_shift); > + ++mr->ms_n; > + } > + } > + len =3D data.end - data.start; > + mr->ms_bmp_n =3D len / msl->page_sz; > + assert(ms_idx_shift + mr->ms_bmp_n <=3D ms_n); > + /* > + * Finally create a verbs MR for the memory chunk. ibv_reg_mr() can > be > + * called with holding the memory lock because it doesn't use > + * mlx5_alloc_buf_extern() which eventually calls > rte_malloc_socket() > + * through mlx5_alloc_verbs_buf(). > + */ > + mr->ibv_mr =3D mlx5_glue->reg_mr(priv->pd, (void *)data.start, len, > + IBV_ACCESS_LOCAL_WRITE); > + if (mr->ibv_mr =3D=3D NULL) { > + DRV_LOG(WARNING, > + "port %u fail to create a verbs MR for address (%p)", > + dev->data->port_id, (void *)addr); > + rte_errno =3D EINVAL; > + goto err_mrlock; > + } > + assert((uintptr_t)mr->ibv_mr->addr =3D=3D data.start); > + assert(mr->ibv_mr->length =3D=3D len); > + LIST_INSERT_HEAD(&priv->mr.mr_list, mr, mr); > + DRV_LOG(DEBUG, > + "port %u MR CREATED (%p) for %p:\n" > + " [0x%lx, 0x%lx), lkey=3D0x%x base_idx=3D%u ms_n=3D%u, > ms_bmp_n=3D%u", > + dev->data->port_id, (void *)mr, (void *)addr, > + data.start, data.end, rte_cpu_to_be_32(mr->ibv_mr->lkey), > + mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n); > + /* Insert to the global cache table. */ > + mr_insert_dev_cache(dev, mr); > + /* Fill in output data. */ > + mr_lookup_dev(dev, entry, addr); > + /* Lookup can't fail. */ > + assert(entry->lkey !=3D UINT32_MAX); > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); > + return entry->lkey; > +err_mrlock: > + rte_rwlock_write_unlock(&priv->mr.rwlock); > +err_memlock: > + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); > +err_nolock: > + /* > + * In case of error, as this can be called in a datapath, a warning > + * message per an error is preferable instead. Must be unlocked > before > + * calling rte_free() because mlx5_mr_mem_event_free_cb() can be > called > + * inside. > + */ > + mr_free(mr); > + return UINT32_MAX; > +} > + > +/** > + * Rebuild the global B-tree cache of device from the original MR list. > + * > + * @param dev > + * Pointer to Ethernet device. > + */ > +static void > +mr_rebuild_dev_cache(struct rte_eth_dev *dev) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr *mr; > + > + DRV_LOG(DEBUG, "port %u rebuild dev cache[]", dev->data- > >port_id); > + /* Flush cache to rebuild. */ > + priv->mr.cache.len =3D 1; > + priv->mr.cache.overflow =3D 0; > + /* Iterate all the existing MRs. */ > + LIST_FOREACH(mr, &priv->mr.mr_list, mr) > + if (mr_insert_dev_cache(dev, mr) < 0) > + return; > +} > + > +/** > + * Callback for memory free event. Iterate freed memsegs and check > whether it > + * belongs to an existing MR. If found, clear the bit from bitmap of MR.= As a > + * result, the MR would be fragmented. If it becomes empty, the MR will = be > freed > + * later by mlx5_mr_garbage_collect(). Even if this callback is called f= rom a > + * secondary process, the garbage collector will be called in primary pr= ocess > + * as the secondary process can't call mlx5_mr_create(). > + * > + * The global cache must be rebuilt if there's any change and this event= has > to > + * be propagated to dataplane threads to flush the local caches. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param addr > + * Address of freed memory. > + * @param len > + * Size of freed memory. > + */ > +static void > +mlx5_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr, > size_t len) > +{ > + struct priv *priv =3D dev->data->dev_private; > + const struct rte_memseg_list *msl; > + struct mlx5_mr *mr; > + int ms_n; > + int i; > + int rebuild =3D 0; > + > + DRV_LOG(DEBUG, "port %u free callback: addr=3D%p, len=3D%lu", > + dev->data->port_id, addr, len); > + msl =3D rte_mem_virt2memseg_list(addr); > + /* addr and len must be page-aligned. */ > + assert((uintptr_t)addr =3D=3D RTE_ALIGN((uintptr_t)addr, msl- > >page_sz)); > + assert(len =3D=3D RTE_ALIGN(len, msl->page_sz)); > + ms_n =3D len / msl->page_sz; > + rte_rwlock_write_lock(&priv->mr.rwlock); > + /* Clear bits of freed memsegs from MR. */ > + for (i =3D 0; i < ms_n; ++i) { > + const struct rte_memseg *ms; > + struct mlx5_mr_cache entry; > + uintptr_t start; > + int ms_idx; > + uint32_t pos; > + > + /* Find MR having this memseg. */ > + start =3D (uintptr_t)addr + i * msl->page_sz; > + mr =3D mr_lookup_dev_list(dev, &entry, start); > + if (mr =3D=3D NULL) > + continue; > + ms =3D rte_mem_virt2memseg((void *)start, msl); > + assert(ms !=3D NULL); > + assert(msl->page_sz =3D=3D ms->hugepage_sz); > + ms_idx =3D rte_fbarray_find_idx(&msl->memseg_arr, ms); > + pos =3D ms_idx - mr->ms_base_idx; > + assert(rte_bitmap_get(mr->ms_bmp, pos)); > + assert(pos < mr->ms_bmp_n); > + DRV_LOG(DEBUG, "port %u MR(%p): clear bitmap[%u] for > addr %p", > + dev->data->port_id, (void *)mr, pos, (void *)start); > + rte_bitmap_clear(mr->ms_bmp, pos); > + if (--mr->ms_n =3D=3D 0) { > + LIST_REMOVE(mr, mr); > + LIST_INSERT_HEAD(&priv->mr.mr_free_list, mr, mr); > + DRV_LOG(DEBUG, "port %u remove MR(%p) from > list", > + dev->data->port_id, (void *)mr); > + } > + /* > + * MR is fragmented or will be freed. the global cache must > be > + * rebuilt. > + */ > + rebuild =3D 1; > + } > + if (rebuild) { > + mr_rebuild_dev_cache(dev); > + /* > + * Flush local caches by propagating invalidation across cores. > + * rte_smp_wmb() is enough to synchronize this event. If > one of > + * freed memsegs is seen by other core, that means the > memseg > + * has been allocated by allocator, which will come after this > + * free call. Therefore, this store instruction (incrementing > + * generation below) will be guaranteed to be seen by other > core > + * before the core sees the newly allocated memory. > + */ > + ++priv->mr.dev_gen; > + DRV_LOG(DEBUG, "broadcasting local cache flush, gen=3D%d", > + priv->mr.dev_gen); > + rte_smp_wmb(); > + } > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + if (rebuild && rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG) > + mlx5_mr_dump_dev(dev); > +} > + > +/** > + * Callback for memory event. This can be called from both primary and > secondary > + * process. > + * > + * @param event_type > + * Memory event type. > + * @param addr > + * Address of memory. > + * @param len > + * Size of memory. > + */ > +void > +mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const void > *addr, > + size_t len) > +{ > + struct priv *priv; > + struct mlx5_dev_list *dev_list =3D &mlx5_shared_data- > >mem_event_cb_list; > + > + switch (event_type) { > + case RTE_MEM_EVENT_FREE: > + rte_rwlock_write_lock(&mlx5_shared_data- > >mem_event_rwlock); > + /* Iterate all the existing mlx5 devices. */ > + LIST_FOREACH(priv, dev_list, mem_event_cb) > + mlx5_mr_mem_event_free_cb(eth_dev(priv), addr, > len); > + rte_rwlock_write_unlock(&mlx5_shared_data- > >mem_event_rwlock); > + break; > + case RTE_MEM_EVENT_ALLOC: > + default: > + break; > + } > +} > + > +/** > + * Look up address in the global MR cache table. If not found, create a = new > MR. > + * Insert the found/created entry to local bottom-half cache table. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param mr_ctrl > + * Pointer to per-queue MR control structure. > + * @param[out] entry > + * Pointer to returning MR cache entry, found in the global cache or n= ewly > + * created. If failed to create one, this is not written. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +static uint32_t > +mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl > *mr_ctrl, > + struct mlx5_mr_cache *entry, uintptr_t addr) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr_btree *bt =3D &mr_ctrl->cache_bh; > + uint16_t idx; > + uint32_t lkey; > + > + /* If local cache table is full, try to double it. */ > + if (unlikely(bt->len =3D=3D bt->size)) > + mr_btree_expand(bt, bt->size << 1); > + /* Look up in the global cache. */ > + rte_rwlock_read_lock(&priv->mr.rwlock); > + lkey =3D mr_btree_lookup(&priv->mr.cache, &idx, addr); > + if (lkey !=3D UINT32_MAX) { > + /* Found. */ > + *entry =3D (*priv->mr.cache.table)[idx]; > + rte_rwlock_read_unlock(&priv->mr.rwlock); > + /* > + * Update local cache. Even if it fails, return the found entry > + * to update top-half cache. Next time, this entry will be > found > + * in the global cache. > + */ > + mr_btree_insert(bt, entry); > + return lkey; > + } > + rte_rwlock_read_unlock(&priv->mr.rwlock); > + /* First time to see the address? Create a new MR. */ > + lkey =3D mlx5_mr_create(dev, entry, addr); Shouldn't we check if the add is not in the global mr list? For the case th= e global cache overflows?=20 > + /* > + * Update the local cache if successfully created a new global MR. > Even > + * if failed to create one, there's no action to take in this datapath > + * code. As returning LKey is invalid, this will eventually make HW > + * fail. > + */ > + if (lkey !=3D UINT32_MAX) > + mr_btree_insert(bt, entry); > + return lkey; > +} > + > +/** > + * Bottom-half of LKey search on datapath. Firstly search in cache_bh[] = and > if > + * misses, search in the global MR cache table and update the new entry = to > + * per-queue local caches. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param mr_ctrl > + * Pointer to per-queue MR control structure. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +static uint32_t > +mlx5_mr_addr2mr_bh(struct rte_eth_dev *dev, struct mlx5_mr_ctrl > *mr_ctrl, > + uintptr_t addr) > +{ > + uint32_t lkey; > + uint16_t bh_idx =3D 0; > + /* Victim in top-half cache to replace with new entry. */ > + struct mlx5_mr_cache *repl =3D &mr_ctrl->cache[mr_ctrl->head]; > + > + /* Binary-search MR translation table. */ > + lkey =3D mr_btree_lookup(&mr_ctrl->cache_bh, &bh_idx, addr); > + /* Update top-half cache. */ > + if (likely(lkey !=3D UINT32_MAX)) { > + *repl =3D (*mr_ctrl->cache_bh.table)[bh_idx]; > + } else { > + /* > + * If missed in local lookup table, search in the global cache > + * and local cache_bh[] will be updated inside if possible. > + * Top-half cache entry will also be updated. > + */ > + lkey =3D mlx5_mr_lookup_dev(dev, mr_ctrl, repl, addr); > + if (unlikely(lkey =3D=3D UINT32_MAX)) > + return UINT32_MAX; > + } > + /* Update the most recently used entry. */ > + mr_ctrl->mru =3D mr_ctrl->head; > + /* Point to the next victim, the oldest. */ > + mr_ctrl->head =3D (mr_ctrl->head + 1) % MLX5_MR_CACHE_N; > + return lkey; > +} > + > +/** > + * Bottom-half of LKey search on Rx. > + * > + * @param rxq > + * Pointer to Rx queue structure. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +uint32_t > +mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr) > +{ > + struct mlx5_rxq_ctrl *rxq_ctrl =3D > + container_of(rxq, struct mlx5_rxq_ctrl, rxq); > + struct mlx5_mr_ctrl *mr_ctrl =3D &rxq->mr_ctrl; > + struct priv *priv =3D rxq_ctrl->priv; > + > + DRV_LOG(DEBUG, > + "Rx queue %u: miss on top-half, mru=3D%u, head=3D%u, > addr=3D%p", > + rxq_ctrl->idx, mr_ctrl->mru, mr_ctrl->head, (void *)addr); > + return mlx5_mr_addr2mr_bh(eth_dev(priv), mr_ctrl, addr); > +} Shouldn't this code path be in the mlxx5_rxq?=20 > + > +/** > + * Bottom-half of LKey search on Tx. > + * > + * @param txq > + * Pointer to Tx queue structure. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +uint32_t > +mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr) > +{ > + struct mlx5_txq_ctrl *txq_ctrl =3D > + container_of(txq, struct mlx5_txq_ctrl, txq); > + struct mlx5_mr_ctrl *mr_ctrl =3D &txq->mr_ctrl; > + struct priv *priv =3D txq_ctrl->priv; > + > + DRV_LOG(DEBUG, > + "Tx queue %u: miss on top-half, mru=3D%u, head=3D%u, > addr=3D%p", > + txq_ctrl->idx, mr_ctrl->mru, mr_ctrl->head, (void *)addr); > + return mlx5_mr_addr2mr_bh(eth_dev(priv), mr_ctrl, addr); > +} > + Same for txq.=20 > +/** > + * Flush all of the local cache entries. > + * > + * @param mr_ctrl > + * Pointer to per-queue MR control structure. > + */ > +void > +mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl) > +{ > + /* Reset the most-recently-used index. */ > + mr_ctrl->mru =3D 0; > + /* Reset the linear search array. */ > + mr_ctrl->head =3D 0; > + memset(mr_ctrl->cache, 0, sizeof(mr_ctrl->cache)); > + /* Reset the B-tree table. */ > + mr_ctrl->cache_bh.len =3D 1; > + mr_ctrl->cache_bh.overflow =3D 0; > + /* Update the generation number. */ > + mr_ctrl->cur_gen =3D *mr_ctrl->dev_gen_ptr; > + DRV_LOG(DEBUG, "mr_ctrl(%p): flushed, cur_gen=3D%d", > + (void *)mr_ctrl, mr_ctrl->cur_gen); > +} > + > +/* Called during rte_mempool_mem_iter() by mlx5_mr_update_mp(). */ > +static void > +mlx5_mr_update_mp_cb(struct rte_mempool *mp __rte_unused, void > *opaque, > + struct rte_mempool_memhdr *memhdr, > + unsigned mem_idx __rte_unused) > +{ > + struct mr_update_mp_data *data =3D opaque; > + uint32_t lkey; > + > + /* Stop iteration if failed in the previous walk. */ > + if (data->ret < 0) > + return; > + /* Register address of the chunk and update local caches. */ > + lkey =3D mlx5_mr_addr2mr_bh(data->dev, data->mr_ctrl, > + (uintptr_t)memhdr->addr); > + if (lkey =3D=3D UINT32_MAX) > + data->ret =3D -1; > +} > + > +/** > + * Register entire memory chunks in a Mempool. > + * > + * @param dev > + * Pointer to Ethernet device. > + * @param mr_ctrl > + * Pointer to per-queue MR control structure. > + * @param mp > + * Pointer to registering Mempool. > + * > + * @return > + * 0 on success, -1 on failure. > + */ > +int > +mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl > *mr_ctrl, > + struct rte_mempool *mp) > +{ > + struct mr_update_mp_data data =3D { > + .dev =3D dev, > + .mr_ctrl =3D mr_ctrl, > + .ret =3D 0, > + }; > + > + rte_mempool_mem_iter(mp, mlx5_mr_update_mp_cb, &data); > + return data.ret; > +} > + > +/** > + * Dump all the created MRs and the global cache entries. > + * > + * @param dev > + * Pointer to Ethernet device. > + */ > +void > +mlx5_mr_dump_dev(struct rte_eth_dev *dev) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr *mr; > + int mr_n =3D 0; > + int chunk_n =3D 0; > + > + rte_rwlock_read_lock(&priv->mr.rwlock); > + /* Iterate all the existing MRs. */ > + LIST_FOREACH(mr, &priv->mr.mr_list, mr) { > + unsigned int n; > + > + DRV_LOG(DEBUG, > + "port %u MR[%u], LKey =3D 0x%x, ms_n =3D %u, > ms_bmp_n =3D %u", > + dev->data->port_id, mr_n++, > + rte_cpu_to_be_32(mr->ibv_mr->lkey), > + mr->ms_n, mr->ms_bmp_n); > + if (mr->ms_n =3D=3D 0) > + continue; > + for (n =3D 0; n < mr->ms_bmp_n; ) { > + struct mlx5_mr_cache ret =3D { 0, }; > + > + n =3D mr_find_next_chunk(mr, &ret, n); > + if (!ret.end) > + break; > + DRV_LOG(DEBUG, " chunk[%u], [0x%lx, 0x%lx)", > + chunk_n++, ret.start, ret.end); > + } > + } > + DRV_LOG(DEBUG, "port %u dumping global cache", dev->data- > >port_id); > + mlx5_mr_btree_dump(&priv->mr.cache); > + rte_rwlock_read_unlock(&priv->mr.rwlock); > +} > + > +/** > + * Release all the created MRs and resources. Remove device from memory > callback > + * list. > + * > + * @param dev > + * Pointer to Ethernet device. > + */ > +void > +mlx5_mr_release(struct rte_eth_dev *dev) > +{ > + struct priv *priv =3D dev->data->dev_private; > + struct mlx5_mr *mr_next =3D LIST_FIRST(&priv->mr.mr_list); > + > + /* Remove from memory callback device list. */ > + rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock); > + LIST_REMOVE(priv, mem_event_cb); > + rte_rwlock_write_unlock(&mlx5_shared_data- > >mem_event_rwlock); > + if (rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG) > + mlx5_mr_dump_dev(dev); > + rte_rwlock_write_lock(&priv->mr.rwlock); > + /* Detach from MR list and move to free list. */ > + while (mr_next !=3D NULL) { > + struct mlx5_mr *mr =3D mr_next; > + > + mr_next =3D LIST_NEXT(mr, mr); > + LIST_REMOVE(mr, mr); > + LIST_INSERT_HEAD(&priv->mr.mr_free_list, mr, mr); > + } > + LIST_INIT(&priv->mr.mr_list); > + /* Free global cache. */ > + mlx5_mr_btree_free(&priv->mr.cache); > + rte_rwlock_write_unlock(&priv->mr.rwlock); > + /* Free all remaining MRs. */ > + mlx5_mr_garbage_collect(dev); > +} > diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h > new file mode 100644 > index 000000000..a0a0ef755 > --- /dev/null > +++ b/drivers/net/mlx5/mlx5_mr.h > @@ -0,0 +1,121 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright 2018 6WIND S.A. > + * Copyright 2018 Mellanox Technologies, Ltd > + */ > + > +#ifndef RTE_PMD_MLX5_MR_H_ > +#define RTE_PMD_MLX5_MR_H_ > + > +#include > +#include > +#include > + > +/* Verbs header. */ > +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */ > +#ifdef PEDANTIC > +#pragma GCC diagnostic ignored "-Wpedantic" > +#endif > +#include > +#include > +#ifdef PEDANTIC > +#pragma GCC diagnostic error "-Wpedantic" > +#endif > + > +#include > +#include > +#include > +#include > + > +/* Memory Region object. */ > +struct mlx5_mr { > + LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */ > + struct ibv_mr *ibv_mr; /* Verbs Memory Region. */ > + const struct rte_memseg_list *msl; > + int ms_base_idx; /* Start index of msl->memseg_arr[]. */ > + int ms_n; /* Number of memsegs in use. */ > + uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */ > + struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to > MR. */ > +}; > + > +/* Cache entry for Memory Region. */ > +struct mlx5_mr_cache { > + uintptr_t start; /* Start address of MR. */ > + uintptr_t end; /* End address of MR. */ > + uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */ > +} __rte_packed; > + > +/* MR Cache table for Binary search. */ > +struct mlx5_mr_btree { > + uint16_t len; /* Number of entries. */ > + uint16_t size; /* Total number of entries. */ > + int overflow; /* Mark failure of table expansion. */ > + struct mlx5_mr_cache (*table)[]; > +} __rte_packed; > + > +/* Per-queue MR control descriptor. */ > +struct mlx5_mr_ctrl { > + uint32_t *dev_gen_ptr; /* Generation number of device to poll. */ > + uint32_t cur_gen; /* Generation number saved to flush caches. */ > + uint16_t mru; /* Index of last hit entry in top-half cache. */ > + uint16_t head; /* Index of the oldest entry in top-half cache. */ > + struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for > top-half. */ > + struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */ > +} __rte_packed; > + > +/* First entry must be NULL for comparison. */ > +#define MR_N(n) ((n) - 1) > + > +/* Whether there's only one entry in MR lookup table. */ > +#define IS_SINGLE_MR(n) (MR_N(n) =3D=3D 1) MLX5_IS_SINGLE_MR > + > +extern struct mlx5_dev_list mlx5_mem_event_cb_list; > +extern rte_rwlock_t mlx5_mem_event_rwlock; > + > +void mlx5_mr_free(struct rte_eth_dev *dev, struct mlx5_mr *mr); > +int mlx5_mr_btree_init(struct mlx5_mr_btree *bt, int n, int socket); > +void mlx5_mr_btree_free(struct mlx5_mr_btree *bt); > +void mlx5_mr_mem_event_cb(enum rte_mem_event event_type, const > void *addr, > + size_t len); > +int mlx5_mr_update_mp(struct rte_eth_dev *dev, struct mlx5_mr_ctrl > *mr_ctrl, > + struct rte_mempool *mp); > +void mlx5_mr_dump_dev(struct rte_eth_dev *dev); > +void mlx5_mr_release(struct rte_eth_dev *dev); > + > +/** > + * Look up LKey from given lookup table by linear search. Firstly look u= p the > + * last-hit entry. If miss, the entire array is searched. If found, upda= te the > + * last-hit index and return LKey. > + * > + * @param lkp_tbl > + * Pointer to lookup table. > + * @param[in,out] cached_idx > + * Pointer to last-hit index. > + * @param n > + * Size of lookup table. > + * @param addr > + * Search key. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +static __rte_always_inline uint32_t > +mlx5_mr_lookup_cache(struct mlx5_mr_cache *lkp_tbl, uint16_t > *cached_idx, > + uint16_t n, uintptr_t addr) > +{ > + uint16_t idx; > + > + if (likely(addr >=3D lkp_tbl[*cached_idx].start && > + addr < lkp_tbl[*cached_idx].end)) > + return lkp_tbl[*cached_idx].lkey; > + for (idx =3D 0; idx < n && lkp_tbl[idx].start !=3D 0; ++idx) { > + if (addr >=3D lkp_tbl[idx].start && > + addr < lkp_tbl[idx].end) { > + /* Found. */ > + *cached_idx =3D idx; > + return lkp_tbl[idx].lkey; > + } > + } > + return UINT32_MAX; > +} > + > +#endif /* RTE_PMD_MLX5_MR_H_ */ > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c > index d4fe1fed7..22e2f9673 100644 > --- a/drivers/net/mlx5/mlx5_rxq.c > +++ b/drivers/net/mlx5/mlx5_rxq.c > @@ -789,7 +789,7 @@ mlx5_rxq_ibv_new(struct rte_eth_dev *dev, > uint16_t idx) > .addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(buf, > uintptr_t)), > .byte_count =3D rte_cpu_to_be_32(DATA_LEN(buf)), > - .lkey =3D UINT32_MAX, > + .lkey =3D mlx5_rx_mb2mr(rxq_data, buf), > }; > } > rxq_data->rq_db =3D rwq.dbrec; > @@ -967,6 +967,11 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t > idx, uint16_t desc, > rte_errno =3D ENOMEM; > return NULL; > } > + if (mlx5_mr_btree_init(&tmpl->rxq.mr_ctrl.cache_bh, > + MLX5_MR_BTREE_CACHE_N, socket)) { > + /* rte_errno is already set. */ > + goto error; > + } > tmpl->socket =3D socket; > if (dev->data->dev_conf.intr_conf.rxq) > tmpl->irq =3D 1; > @@ -1120,6 +1125,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, > uint16_t idx) > DRV_LOG(DEBUG, "port %u Rx queue %u: refcnt %d", > dev->data->port_id, rxq_ctrl->idx, > rte_atomic32_read(&rxq_ctrl->refcnt)); > + mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh); > LIST_REMOVE(rxq_ctrl, next); > rte_free(rxq_ctrl); > (*priv->rxqs)[idx] =3D NULL; > diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c > index 56c243495..8a863c157 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.c > +++ b/drivers/net/mlx5/mlx5_rxtx.c > @@ -1965,6 +1965,9 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf > **pkts, uint16_t pkts_n) > * changes. > */ > wqe->addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(rep, > uintptr_t)); > + /* If there's only one MR, no need to replace LKey in WQE. */ > + if (unlikely(!IS_SINGLE_MR(rxq->mr_ctrl.cache_bh.len))) > + wqe->lkey =3D mlx5_rx_mb2mr(rxq, rep); > if (len > DATA_LEN(seg)) { > len -=3D DATA_LEN(seg); > ++NB_SEGS(pkt); > diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > index e8cad51aa..74581cf9b 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.h > +++ b/drivers/net/mlx5/mlx5_rxtx.h > @@ -29,6 +29,7 @@ >=20 > #include "mlx5_utils.h" > #include "mlx5.h" > +#include "mlx5_mr.h" > #include "mlx5_autoconf.h" > #include "mlx5_defs.h" > #include "mlx5_prm.h" > @@ -81,6 +82,7 @@ struct mlx5_rxq_data { > uint16_t rq_ci; > uint16_t rq_pi; > uint16_t cq_ci; > + struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */ > volatile struct mlx5_wqe_data_seg(*wqes)[]; > volatile struct mlx5_cqe(*cqes)[]; > struct rxq_zip zip; /* Compressed context. */ > @@ -109,8 +111,8 @@ struct mlx5_rxq_ibv { > struct mlx5_rxq_ctrl { > LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */ > rte_atomic32_t refcnt; /* Reference counter. */ > - struct priv *priv; /* Back pointer to private data. */ > struct mlx5_rxq_ibv *ibv; /* Verbs elements. */ > + struct priv *priv; /* Back pointer to private data. */ > struct mlx5_rxq_data rxq; /* Data path structure. */ > unsigned int socket; /* CPU socket ID for allocations. */ > uint32_t tunnel_types[16]; /* Tunnel type counter. */ > @@ -165,6 +167,7 @@ struct mlx5_txq_data { > uint16_t inline_max_packet_sz; /* Max packet size for inlining. */ > uint32_t qp_num_8s; /* QP number shifted by 8. */ > uint64_t offloads; /* Offloads for Tx Queue. */ > + struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */ > volatile struct mlx5_cqe (*cqes)[]; /* Completion queue. */ > volatile void *wqes; /* Work queue (use volatile to write into). */ > volatile uint32_t *qp_db; /* Work queue doorbell. */ > @@ -187,11 +190,11 @@ struct mlx5_txq_ibv { > struct mlx5_txq_ctrl { > LIST_ENTRY(mlx5_txq_ctrl) next; /* Pointer to the next element. */ > rte_atomic32_t refcnt; /* Reference counter. */ > - struct priv *priv; /* Back pointer to private data. */ > unsigned int socket; /* CPU socket ID for allocations. */ > unsigned int max_inline_data; /* Max inline data. */ > unsigned int max_tso_header; /* Max TSO header size. */ > struct mlx5_txq_ibv *ibv; /* Verbs queue object. */ > + struct priv *priv; /* Back pointer to private data. */ > struct mlx5_txq_data txq; /* Data path structure. */ > off_t uar_mmap_offset; /* UAR mmap offset for non-primary > process. */ > volatile void *bf_reg_orig; /* Blueflame register from verbs. */ > @@ -308,6 +311,12 @@ uint16_t mlx5_tx_burst_vec(void *dpdk_txq, struct > rte_mbuf **pkts, > uint16_t mlx5_rx_burst_vec(void *dpdk_txq, struct rte_mbuf **pkts, > uint16_t pkts_n); >=20 > +/* mlx5_mr.c */ > + > +void mlx5_mr_flush_local_cache(struct mlx5_mr_ctrl *mr_ctrl); > +uint32_t mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr); > +uint32_t mlx5_tx_addr2mr_bh(struct mlx5_txq_data *txq, uintptr_t addr); > + > #ifndef NDEBUG > /** > * Verify or set magic value in CQE. > @@ -493,14 +502,66 @@ mlx5_tx_complete(struct mlx5_txq_data *txq) > *txq->cq_db =3D rte_cpu_to_be_32(cq_ci); > } >=20 > +/** > + * Query LKey from a packet buffer for Rx. No need to flush local caches= for > Rx > + * as mempool is pre-configured and static. > + * > + * @param rxq > + * Pointer to Rx queue structure. > + * @param addr > + * Address to search. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > static __rte_always_inline uint32_t > -mlx5_tx_mb2mr(struct mlx5_txq_data *txq, struct rte_mbuf *mb) > +mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr) > { > - (void)txq; > - (void)mb; > - return UINT32_MAX; > + struct mlx5_mr_ctrl *mr_ctrl =3D &rxq->mr_ctrl; > + uint32_t lkey; > + > + /* Linear search on MR cache array. */ > + lkey =3D mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru, > + MLX5_MR_CACHE_N, addr); > + if (likely(lkey !=3D UINT32_MAX)) > + return lkey; > + /* Take slower bottom-half (Binary Search) on miss. */ > + return mlx5_rx_addr2mr_bh(rxq, addr); > } >=20 > +#define mlx5_rx_mb2mr(rxq, mb) mlx5_rx_addr2mr(rxq, (uintptr_t)((mb)- > >buf_addr)) > + > +/** > + * Query LKey from a packet buffer for Tx. If not found, add the mempool= . > + * > + * @param txq > + * Pointer to Tx queue structure. > + * @param addr > + * Address to search. > + * > + * @return > + * Searched LKey on success, UINT32_MAX on no match. > + */ > +static __rte_always_inline uint32_t > +mlx5_tx_addr2mr(struct mlx5_txq_data *txq, uintptr_t addr) > +{ > + struct mlx5_mr_ctrl *mr_ctrl =3D &txq->mr_ctrl; > + uint32_t lkey; > + > + /* Check generation bit to see if there's any change on existing MRs. > */ > + if (unlikely(*mr_ctrl->dev_gen_ptr !=3D mr_ctrl->cur_gen)) > + mlx5_mr_flush_local_cache(mr_ctrl); > + /* Linear search on MR cache array. */ > + lkey =3D mlx5_mr_lookup_cache(mr_ctrl->cache, &mr_ctrl->mru, > + MLX5_MR_CACHE_N, addr); > + if (likely(lkey !=3D UINT32_MAX)) > + return lkey; > + /* Take slower bottom-half (binary search) on miss. */ > + return mlx5_tx_addr2mr_bh(txq, addr); > +} > + > +#define mlx5_tx_mb2mr(rxq, mb) mlx5_tx_addr2mr(rxq, (uintptr_t)((mb)- > >buf_addr)) > + > /** > * Ring TX queue doorbell and flush the update if requested. > * > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h > b/drivers/net/mlx5/mlx5_rxtx_vec.h > index 56c5a1b0c..76678a820 100644 > --- a/drivers/net/mlx5/mlx5_rxtx_vec.h > +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h > @@ -99,9 +99,13 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data > *rxq, uint16_t n) > rxq->stats.rx_nombuf +=3D n; > return; > } > - for (i =3D 0; i < n; ++i) > + for (i =3D 0; i < n; ++i) { > wq[i].addr =3D rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr > + > RTE_PKTMBUF_HEADROOM); > + /* If there's only one MR, no need to replace LKey in WQE. */ > + if (unlikely(!IS_SINGLE_MR(rxq->mr_ctrl.cache_bh.len))) > + wq[i].lkey =3D mlx5_rx_mb2mr(rxq, elts[i]); > + } > rxq->rq_ci +=3D n; > /* Prevent overflowing into consumed mbufs. */ > elts_idx =3D rxq->rq_ci & q_mask; > diff --git a/drivers/net/mlx5/mlx5_trigger.c > b/drivers/net/mlx5/mlx5_trigger.c > index 3db6c3f35..36b7c9e2f 100644 > --- a/drivers/net/mlx5/mlx5_trigger.c > +++ b/drivers/net/mlx5/mlx5_trigger.c > @@ -104,9 +104,18 @@ mlx5_rxq_start(struct rte_eth_dev *dev) >=20 > for (i =3D 0; i !=3D priv->rxqs_n; ++i) { > struct mlx5_rxq_ctrl *rxq_ctrl =3D mlx5_rxq_get(dev, i); > + struct rte_mempool *mp; >=20 > if (!rxq_ctrl) > continue; > + /* Pre-register Rx mempool. */ > + mp =3D rxq_ctrl->rxq.mp; > + DRV_LOG(DEBUG, > + "port %u Rx queue %u registering" > + " mp %s having %u chunks", > + dev->data->port_id, rxq_ctrl->idx, > + mp->name, mp->nb_mem_chunks); > + mlx5_mr_update_mp(dev, &rxq_ctrl->rxq.mr_ctrl, mp); > ret =3D rxq_alloc_elts(rxq_ctrl); > if (ret) > goto error; > @@ -154,6 +163,8 @@ mlx5_dev_start(struct rte_eth_dev *dev) > dev->data->port_id, strerror(rte_errno)); > goto error; > } > + if (rte_log_get_level(mlx5_logtype) =3D=3D RTE_LOG_DEBUG) > + mlx5_mr_dump_dev(dev); > ret =3D mlx5_rx_intr_vec_enable(dev); > if (ret) { > DRV_LOG(ERR, "port %u Rx interrupt vector creation failed", > diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c > index a71f3d0f0..9ce6f2098 100644 > --- a/drivers/net/mlx5/mlx5_txq.c > +++ b/drivers/net/mlx5/mlx5_txq.c > @@ -804,6 +804,13 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t > idx, uint16_t desc, > rte_errno =3D ENOMEM; > return NULL; > } > + if (mlx5_mr_btree_init(&tmpl->txq.mr_ctrl.cache_bh, > + MLX5_MR_BTREE_CACHE_N, socket)) { > + /* rte_errno is already set. */ > + goto error; > + } > + /* Save pointer of global generation number to check memory > event. */ > + tmpl->txq.mr_ctrl.dev_gen_ptr =3D &priv->mr.dev_gen; > assert(desc > MLX5_TX_COMP_THRESH); > tmpl->txq.offloads =3D conf->offloads; > tmpl->priv =3D priv; > @@ -823,6 +830,9 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t > idx, uint16_t desc, > idx, rte_atomic32_read(&tmpl->refcnt)); > LIST_INSERT_HEAD(&priv->txqsctrl, tmpl, next); > return tmpl; > +error: > + rte_free(tmpl); > + return NULL; > } >=20 > /** > @@ -882,6 +892,7 @@ mlx5_txq_release(struct rte_eth_dev *dev, uint16_t > idx) > dev->data->port_id, txq->idx, > rte_atomic32_read(&txq->refcnt)); > txq_free_elts(txq); > + mlx5_mr_btree_free(&txq->txq.mr_ctrl.cache_bh); > LIST_REMOVE(txq, next); > rte_free(txq); > (*priv->txqs)[idx] =3D NULL; > -- > 2.11.0