DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask
@ 2018-08-31 12:50 Alejandro Lucero
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses Alejandro Lucero
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

I sent a patchset about this to be applied on 17.11 stable. The memory
code has had main changes since that version, so here it is the patchset
adjusted to current master repo.

This patchset adds, mainly, a check for ensuring IOVAs are within a
restricted range due to addressing limitations with some devices. There
are two known cases: NFP and IOMMU VT-d emulation.

With this check IOVAs out of range are detected and PMDs can abort
initialization. For the VT-d case, IOVA VA mode is allowed as long as
IOVAs are within the supported range, avoiding to forbid IOVA VA by
default.

For the addressing limitations known cases, there are just 40(NFP) or
39(VT-d) bits for handling IOVAs. When using IOVA PA, those limitations
imply 1TB(NFP) or 512M(VT-d) as upper limits, which is likely enough for
most systems. With machines using more memory, the added check will
ensure IOVAs within the range.

With IOVA VA, and because the way the Linux kernel serves mmap calls
in 64 bits systems, 39 or 40 bits are not enough. It is possible to
give an address hint with a lower starting address than the default one
used by the kernel, and then ensuring the mmap uses that hint or hint plus
some offset. With 64 bits systems, the process virtual address space is
large enoguh for doing the hugepages mmaping within the supported range
when those addressing limitations exist. This patchset also adds a change
for using such a hint making the use of IOVA VA a more than likely
possibility when there are those addressing limitations.

The check is not done by default but just when it is required. This
patchset adds the check for NFP initialization and for setting the IOVA
mode is an emulated VT-d is detected. Also, because the recent patchset
adding dynamic memory allocation, the check is also invoked for ensuring
the new memsegs are within the required range.

This patchset could be applied to stable 18.05.

v2:
 - fix problem with max dma mask definition

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
@ 2018-08-31 12:50 ` Alejandro Lucero
  2018-10-03 12:43   ` Burakov, Anatoly
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages Alejandro Lucero
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

A device can suffer addressing limitations. This functions checks
memsegs have iovas within the supported range based on dma mask.

PMD should use this during initialization if supported devices
suffer addressing limitations, returning an error if this function
returns memsegs out of range.

Another potential usage is for emulated IOMMU hardware with addressing
limitations.

It is necessary to save the most restricted dma mask for checking
memory allocated dynamically after initialization.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 lib/librte_eal/common/eal_common_memory.c         | 56 +++++++++++++++++++++++
 lib/librte_eal/common/include/rte_eal_memconfig.h |  3 ++
 lib/librte_eal/common/include/rte_memory.h        |  3 ++
 lib/librte_eal/common/malloc_heap.c               | 12 +++++
 lib/librte_eal/linuxapp/eal/eal.c                 |  2 +
 lib/librte_eal/rte_eal_version.map                |  1 +
 6 files changed, 77 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index fbfb1b0..bdd8f44 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -383,6 +383,62 @@ struct virtiova {
 	rte_memseg_walk(dump_memseg, f);
 }
 
+static int
+check_iova(const struct rte_memseg_list *msl __rte_unused,
+		const struct rte_memseg *ms, void *arg)
+{
+	uint64_t *mask = arg;
+	rte_iova_t iova;
+
+	/* higher address within segment */
+	iova = (ms->iova + ms->len) - 1;
+	if (!(iova & *mask))
+		return 0;
+
+	RTE_LOG(INFO, EAL, "memseg iova %"PRIx64", len %zx, out of range\n",
+			   ms->iova, ms->len);
+
+	RTE_LOG(INFO, EAL, "\tusing dma mask %"PRIx64"\n", *mask);
+	/* Stop the walk and change mask */
+	*mask = 0;
+	return 1;
+}
+
+#if defined(RTE_ARCH_64)
+#define MAX_DMA_MASK_BITS 63
+#else
+#define MAX_DMA_MASK_BITS 31
+#endif
+
+/* check memseg iovas are within the required range based on dma mask */
+int __rte_experimental
+rte_eal_check_dma_mask(uint8_t maskbits)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	uint64_t mask;
+
+	/* sanity check */
+	if (maskbits > MAX_DMA_MASK_BITS) {
+		RTE_LOG(INFO, EAL, "wrong dma mask size %u (Max: %u)\n",
+				   maskbits, MAX_DMA_MASK_BITS);
+		return -1;
+	}
+
+	/* keep the more restricted maskbit */
+	if (!mcfg->dma_maskbits || maskbits < mcfg->dma_maskbits)
+		mcfg->dma_maskbits = maskbits;
+
+	/* create dma mask */
+	mask = ~((1ULL << maskbits) - 1);
+
+	rte_memseg_walk(check_iova, &mask);
+
+	if (!mask)
+		return -1;
+
+	return 0;
+}
+
 /* return the number of memory channels */
 unsigned rte_memory_get_nchannel(void)
 {
diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h
index aff0688..aea44cb 100644
--- a/lib/librte_eal/common/include/rte_eal_memconfig.h
+++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
@@ -77,6 +77,9 @@ struct rte_mem_config {
 	 * exact same address the primary process maps it.
 	 */
 	uint64_t mem_cfg_addr;
+
+	/* keeps the more restricted dma mask */
+	uint8_t dma_maskbits;
 } __attribute__((__packed__));
 
 
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index c4b7f4c..cd439e3 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -357,6 +357,9 @@ typedef int (*rte_memseg_list_walk_t)(const struct rte_memseg_list *msl,
  */
 unsigned rte_memory_get_nrank(void);
 
+/* check memsegs iovas are within a range based on dma mask */
+int rte_eal_check_dma_mask(uint8_t maskbits);
+
 /**
  * Drivers based on uio will not load unless physical
  * addresses are obtainable. It is only possible to get
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 12aaf2d..255d717 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -259,11 +259,13 @@ struct malloc_elem *
 		int socket, unsigned int flags, size_t align, size_t bound,
 		bool contig, struct rte_memseg **ms, int n_segs)
 {
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
 	struct rte_memseg_list *msl;
 	struct malloc_elem *elem = NULL;
 	size_t alloc_sz;
 	int allocd_pages;
 	void *ret, *map_addr;
+	uint64_t mask;
 
 	alloc_sz = (size_t)pg_sz * n_segs;
 
@@ -291,6 +293,16 @@ struct malloc_elem *
 		goto fail;
 	}
 
+	if (mcfg->dma_maskbits) {
+		mask = ~((1ULL << mcfg->dma_maskbits) - 1);
+		if (rte_eal_check_dma_mask(mask)) {
+			RTE_LOG(DEBUG, EAL,
+				"%s(): couldn't allocate memory due to DMA mask\n",
+				__func__);
+			goto fail;
+		}
+	}
+
 	/* add newly minted memsegs to malloc heap */
 	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz);
 
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index e59ac65..616723e 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -263,6 +263,8 @@ enum rte_iova_mode
 	 * processes could later map the config into this exact location */
 	rte_config.mem_config->mem_cfg_addr = (uintptr_t) rte_mem_cfg_addr;
 
+	rte_config.mem_config->dma_maskbits = 0;
+
 }
 
 /* attach to an existing shared memory config */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 344a43d..85e6212 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -284,6 +284,7 @@ EXPERIMENTAL {
 	rte_devargs_parsef;
 	rte_devargs_remove;
 	rte_devargs_type_count;
+	rte_eal_check_dma_mask;
 	rte_eal_cleanup;
 	rte_eal_hotplug_add;
 	rte_eal_hotplug_remove;
-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses Alejandro Lucero
@ 2018-08-31 12:50 ` Alejandro Lucero
  2018-10-03 12:50   ` Burakov, Anatoly
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode Alejandro Lucero
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

Linux kernel uses a really high address as starting address for
serving mmaps calls. If there exist addressing limitations and
IOVA mode is VA, this starting address is likely too high for
those devices. However, it is possible to use a lower address in
the process virtual address space as with 64 bits there is a lot
of available space.

This patch adds an address hint as starting address for 64 bits
systems.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 lib/librte_eal/common/eal_common_memory.c | 35 ++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index bdd8f44..97378b1 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -37,6 +37,23 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
+#ifdef RTE_ARCH_64
+/*
+ * Linux kernel uses a really high address as starting address for serving
+ * mmaps calls. If there exists addressing limitations and IOVA mode is VA,
+ * this starting address is likely too high for those devices. However, it
+ * is possible to use a lower address in the process virtual address space
+ * as with 64 bits there is a lot of available space.
+ *
+ * Current known limitations are 39 or 40 bits. Setting the starting address
+ * at 4GB implies there are 508GB or 1020GB for mapping the available
+ * hugepages. This is likely enough for most systems, although a device with
+ * addressing limitations should call rte_eal_check_dma_mask for ensuring all
+ * memory is within supported range.
+ */
+static uint64_t baseaddr = 0x100000000;
+#endif
+
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int mmap_flags)
@@ -60,6 +77,11 @@
 			rte_eal_process_type() == RTE_PROC_PRIMARY)
 		next_baseaddr = (void *) internal_config.base_virtaddr;
 
+#ifdef RTE_ARCH_64
+	if (next_baseaddr == NULL && internal_config.base_virtaddr == 0 &&
+			rte_eal_process_type() == RTE_PROC_PRIMARY)
+		next_baseaddr = (void *) baseaddr;
+#endif
 	if (requested_addr == NULL && next_baseaddr != NULL) {
 		requested_addr = next_baseaddr;
 		requested_addr = RTE_PTR_ALIGN(requested_addr, page_sz);
@@ -89,9 +111,20 @@
 
 		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ,
 				mmap_flags, -1, 0);
+
 		if (mapped_addr == MAP_FAILED && allow_shrink)
 			*size -= page_sz;
-	} while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
+
+		if (mapped_addr != MAP_FAILED && addr_is_hint &&
+		    mapped_addr != requested_addr) {
+			/* hint was not used. Try with another offset */
+			munmap(mapped_addr, map_sz);
+			mapped_addr = MAP_FAILED;
+			next_baseaddr = RTE_PTR_ADD(next_baseaddr, 0x100000000);
+			requested_addr = next_baseaddr;
+		}
+	} while ((allow_shrink || addr_is_hint) &&
+		 mapped_addr == MAP_FAILED && *size > 0);
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses Alejandro Lucero
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages Alejandro Lucero
@ 2018-08-31 12:50 ` Alejandro Lucero
  2018-10-03 12:55   ` Burakov, Anatoly
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 4/5] net/nfp: check hugepages IOVAs based on DMA mask Alejandro Lucero
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

Although VT-d emulation currently only supports 39 bits, it could
be iovas being within that supported range. This patch allows
IOVA mode in such a case.

Indeed, memory initialization code can be modified for using lower
virtual addresses than those used by the kernel for 64 bits processes
by default, and therefore memsegs iovas can use 39 bits or less for
most system. And this is likely 100% true for VMs.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 drivers/bus/pci/linux/pci.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 04648ac..215dc10 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -588,10 +588,11 @@
 	fclose(fp);
 
 	mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >> VTD_CAP_MGAW_SHIFT) + 1;
-	if (mgaw < X86_VA_WIDTH)
-		return false;
 
-	return true;
+	if (!rte_eal_check_dma_mask(mgaw))
+		return true;
+	else
+		return false;
 }
 #elif defined(RTE_ARCH_PPC_64)
 static bool
@@ -615,13 +616,17 @@
 {
 	struct rte_pci_device *dev = NULL;
 	struct rte_pci_driver *drv = NULL;
+	int iommu_dma_mask_check_done = 0;
 
 	FOREACH_DRIVER_ON_PCIBUS(drv) {
 		FOREACH_DEVICE_ON_PCIBUS(dev) {
 			if (!rte_pci_match(drv, dev))
 				continue;
-			if (!pci_one_device_iommu_support_va(dev))
-				return false;
+			if (!iommu_dma_mask_check_done) {
+				if (!pci_one_device_iommu_support_va(dev))
+					return false;
+				iommu_dma_mask_check_done  = 1;
+			}
 		}
 	}
 	return true;
-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] net/nfp: check hugepages IOVAs based on DMA mask
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
                   ` (2 preceding siblings ...)
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode Alejandro Lucero
@ 2018-08-31 12:50 ` Alejandro Lucero
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 5/5] net/nfp: support IOVA VA mode Alejandro Lucero
  2018-10-02 16:33 ` [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
  5 siblings, 0 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

NFP devices can not handle DMA addresses requiring more than
40 bits. This patch uses rte_dev_check_dma_mask with 40 bits
and avoids device initialization if memory out of NFP range.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 drivers/net/nfp/nfp_net.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/nfp/nfp_net.c b/drivers/net/nfp/nfp_net.c
index 6e5e305..aedca31 100644
--- a/drivers/net/nfp/nfp_net.c
+++ b/drivers/net/nfp/nfp_net.c
@@ -2688,6 +2688,14 @@ uint32_t nfp_net_txq_full(struct nfp_net_txq *txq)
 
 	pci_dev = RTE_ETH_DEV_TO_PCI(eth_dev);
 
+	/* NFP can not handle DMA addresses requiring more than 40 bits */
+	if (rte_eal_check_dma_mask(40) < 0) {
+		RTE_LOG(INFO, PMD, "device %s can not be used:",
+				   pci_dev->device.name);
+		RTE_LOG(INFO, PMD, "\trestricted dma mask to 40 bits!\n");
+		return -ENODEV;
+	};
+
 	if ((pci_dev->id.device_id == PCI_DEVICE_ID_NFP4000_PF_NIC) ||
 	    (pci_dev->id.device_id == PCI_DEVICE_ID_NFP6000_PF_NIC)) {
 		port = get_pf_port_number(eth_dev->data->name);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] net/nfp: support IOVA VA mode
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
                   ` (3 preceding siblings ...)
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 4/5] net/nfp: check hugepages IOVAs based on DMA mask Alejandro Lucero
@ 2018-08-31 12:50 ` Alejandro Lucero
  2018-10-02 16:33 ` [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
  5 siblings, 0 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-08-31 12:50 UTC (permalink / raw)
  To: dev; +Cc: stable

NFP can handle IOVA as VA. It requires to check those IOVAs
being in the supported range what is done during initialization.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 drivers/net/nfp/nfp_net.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/nfp/nfp_net.c b/drivers/net/nfp/nfp_net.c
index aedca31..1a7d58c 100644
--- a/drivers/net/nfp/nfp_net.c
+++ b/drivers/net/nfp/nfp_net.c
@@ -3273,14 +3273,16 @@ static int eth_nfp_pci_remove(struct rte_pci_device *pci_dev)
 
 static struct rte_pci_driver rte_nfp_net_pf_pmd = {
 	.id_table = pci_id_nfp_pf_net_map,
-	.drv_flags = RTE_PCI_DRV_NEED_MAPPING | RTE_PCI_DRV_INTR_LSC,
+	.drv_flags = RTE_PCI_DRV_NEED_MAPPING | RTE_PCI_DRV_INTR_LSC |
+		     RTE_PCI_DRV_IOVA_AS_VA,
 	.probe = nfp_pf_pci_probe,
 	.remove = eth_nfp_pci_remove,
 };
 
 static struct rte_pci_driver rte_nfp_net_vf_pmd = {
 	.id_table = pci_id_nfp_vf_net_map,
-	.drv_flags = RTE_PCI_DRV_NEED_MAPPING | RTE_PCI_DRV_INTR_LSC,
+	.drv_flags = RTE_PCI_DRV_NEED_MAPPING | RTE_PCI_DRV_INTR_LSC |
+		     RTE_PCI_DRV_IOVA_AS_VA,
 	.probe = eth_nfp_pci_probe,
 	.remove = eth_nfp_pci_remove,
 };
-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask
  2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
                   ` (4 preceding siblings ...)
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 5/5] net/nfp: support IOVA VA mode Alejandro Lucero
@ 2018-10-02 16:33 ` Alejandro Lucero
  2018-10-02 21:21   ` Thomas Monjalon
  5 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-02 16:33 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Burakov, Anatoly

As I was told, I sent this patchset early September (actually it was still
October) for making it for 18.11.

I wonder if this will be applied when doing the integration phase or it
will need to wait until 19.02.

On Fri, Aug 31, 2018 at 1:51 PM Alejandro Lucero <
alejandro.lucero@netronome.com> wrote:

> I sent a patchset about this to be applied on 17.11 stable. The memory
> code has had main changes since that version, so here it is the patchset
> adjusted to current master repo.
>
> This patchset adds, mainly, a check for ensuring IOVAs are within a
> restricted range due to addressing limitations with some devices. There
> are two known cases: NFP and IOMMU VT-d emulation.
>
> With this check IOVAs out of range are detected and PMDs can abort
> initialization. For the VT-d case, IOVA VA mode is allowed as long as
> IOVAs are within the supported range, avoiding to forbid IOVA VA by
> default.
>
> For the addressing limitations known cases, there are just 40(NFP) or
> 39(VT-d) bits for handling IOVAs. When using IOVA PA, those limitations
> imply 1TB(NFP) or 512M(VT-d) as upper limits, which is likely enough for
> most systems. With machines using more memory, the added check will
> ensure IOVAs within the range.
>
> With IOVA VA, and because the way the Linux kernel serves mmap calls
> in 64 bits systems, 39 or 40 bits are not enough. It is possible to
> give an address hint with a lower starting address than the default one
> used by the kernel, and then ensuring the mmap uses that hint or hint plus
> some offset. With 64 bits systems, the process virtual address space is
> large enoguh for doing the hugepages mmaping within the supported range
> when those addressing limitations exist. This patchset also adds a change
> for using such a hint making the use of IOVA VA a more than likely
> possibility when there are those addressing limitations.
>
> The check is not done by default but just when it is required. This
> patchset adds the check for NFP initialization and for setting the IOVA
> mode is an emulated VT-d is detected. Also, because the recent patchset
> adding dynamic memory allocation, the check is also invoked for ensuring
> the new memsegs are within the required range.
>
> This patchset could be applied to stable 18.05.
>
> v2:
>  - fix problem with max dma mask definition
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask
  2018-10-02 16:33 ` [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
@ 2018-10-02 21:21   ` Thomas Monjalon
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Monjalon @ 2018-10-02 21:21 UTC (permalink / raw)
  To: Alejandro Lucero, Burakov, Anatoly; +Cc: dev

02/10/2018 18:33, Alejandro Lucero:
> As I was told, I sent this patchset early September (actually it was still
> October) for making it for 18.11.
> 
> I wonder if this will be applied when doing the integration phase or it
> will need to wait until 19.02.

Such patchset cannot be applied after -rc1.
Anatoly, please can you review it to help its integration in 18.11-rc1?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses Alejandro Lucero
@ 2018-10-03 12:43   ` Burakov, Anatoly
       [not found]     ` <CAD+H991m6qauwX+P=muKe6bAjNLUrcBaGbxFXkMV60OVNvRgPg@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-03 12:43 UTC (permalink / raw)
  To: Alejandro Lucero, dev; +Cc: stable

On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> A device can suffer addressing limitations. This functions checks
> memsegs have iovas within the supported range based on dma mask.
> 
> PMD should use this during initialization if supported devices
> suffer addressing limitations, returning an error if this function
> returns memsegs out of range.
> 
> Another potential usage is for emulated IOMMU hardware with addressing
> limitations.
> 
> It is necessary to save the most restricted dma mask for checking
> memory allocated dynamically after initialization.
> 
> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> ---
>   lib/librte_eal/common/eal_common_memory.c         | 56 +++++++++++++++++++++++
>   lib/librte_eal/common/include/rte_eal_memconfig.h |  3 ++
>   lib/librte_eal/common/include/rte_memory.h        |  3 ++
>   lib/librte_eal/common/malloc_heap.c               | 12 +++++
>   lib/librte_eal/linuxapp/eal/eal.c                 |  2 +
>   lib/librte_eal/rte_eal_version.map                |  1 +
>   6 files changed, 77 insertions(+)
> 
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index fbfb1b0..bdd8f44 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -383,6 +383,62 @@ struct virtiova {
>   	rte_memseg_walk(dump_memseg, f);
>   }
>   
> +static int
> +check_iova(const struct rte_memseg_list *msl __rte_unused,
> +		const struct rte_memseg *ms, void *arg)
> +{
> +	uint64_t *mask = arg;
> +	rte_iova_t iova;
> +
> +	/* higher address within segment */
> +	iova = (ms->iova + ms->len) - 1;
> +	if (!(iova & *mask))
> +		return 0;
> +
> +	RTE_LOG(INFO, EAL, "memseg iova %"PRIx64", len %zx, out of range\n",
> +			   ms->iova, ms->len);
> +
> +	RTE_LOG(INFO, EAL, "\tusing dma mask %"PRIx64"\n", *mask);

IMO putting these as INFO is overkill. I'd prefer not to spam the output 
unless it's really important. Can this go under DEBUG?

Also, the message is misleading. You stop before you have a chance to 
check other masks, which may restrict them even further. You're 
outputting the message about using DMA mask XXX but this may not be the 
final DMA mask.

> +	/* Stop the walk and change mask */
> +	*mask = 0;
> +	return 1;
> +}
> +
> +#if defined(RTE_ARCH_64)
> +#define MAX_DMA_MASK_BITS 63
> +#else
> +#define MAX_DMA_MASK_BITS 31
> +#endif
> +
> +/* check memseg iovas are within the required range based on dma mask */
> +int __rte_experimental
> +rte_eal_check_dma_mask(uint8_t maskbits)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> +	uint64_t mask;
> +
> +	/* sanity check */
> +	if (maskbits > MAX_DMA_MASK_BITS) {
> +		RTE_LOG(INFO, EAL, "wrong dma mask size %u (Max: %u)\n",
> +				   maskbits, MAX_DMA_MASK_BITS);

Should be ERR, not INFO.

> +		return -1;
> +	}
> +
> +	/* keep the more restricted maskbit */
> +	if (!mcfg->dma_maskbits || maskbits < mcfg->dma_maskbits)
> +		mcfg->dma_maskbits = maskbits;

Do we need to modify mcfg->dma_maskbits before we know if we're going to 
fail? Suggest using a local variable maybe?

Also, i think it's a good case for ternary:

bits = mcfg->dma_maskbits == 0 ?
	maskbits :
	RTE_MIN(maskbits, mcfg->dma_maskbits);

IMO the intention looks much clearer.

> +
> +	/* create dma mask */
> +	mask = ~((1ULL << maskbits) - 1);
> +
> +	rte_memseg_walk(check_iova, &mask);
> +
> +	if (!mask)
> +		return -1;
> +
> +	return 0;
> +}
> +
>   /* return the number of memory channels */
>   unsigned rte_memory_get_nchannel(void)
>   {
> diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h
> index aff0688..aea44cb 100644
> --- a/lib/librte_eal/common/include/rte_eal_memconfig.h
> +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
> @@ -77,6 +77,9 @@ struct rte_mem_config {
>   	 * exact same address the primary process maps it.
>   	 */
>   	uint64_t mem_cfg_addr;
> +
> +	/* keeps the more restricted dma mask */
> +	uint8_t dma_maskbits;

This needs to be documented as an ABI break in the 18.11 release notes.


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages Alejandro Lucero
@ 2018-10-03 12:50   ` Burakov, Anatoly
  2018-10-04 11:43     ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-03 12:50 UTC (permalink / raw)
  To: Alejandro Lucero, dev; +Cc: stable

On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> Linux kernel uses a really high address as starting address for
> serving mmaps calls. If there exist addressing limitations and
> IOVA mode is VA, this starting address is likely too high for
> those devices. However, it is possible to use a lower address in
> the process virtual address space as with 64 bits there is a lot
> of available space.
> 
> This patch adds an address hint as starting address for 64 bits
> systems.
> 
> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> ---

<snip>

>   
>   		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ,
>   				mmap_flags, -1, 0);
> +
>   		if (mapped_addr == MAP_FAILED && allow_shrink)

Unintended whitespace change?

>   			*size -= page_sz;
> -	} while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
> +
> +		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> +		    mapped_addr != requested_addr) {
> +			/* hint was not used. Try with another offset */
> +			munmap(mapped_addr, map_sz);
> +			mapped_addr = MAP_FAILED;
> +			next_baseaddr = RTE_PTR_ADD(next_baseaddr, 0x100000000);

Why not increment by page size? Sure, it could take some more time to 
allocate, but will result in less wasted memory.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode
  2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode Alejandro Lucero
@ 2018-10-03 12:55   ` Burakov, Anatoly
  2018-10-04 13:35     ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-03 12:55 UTC (permalink / raw)
  To: Alejandro Lucero, dev; +Cc: stable

On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> Although VT-d emulation currently only supports 39 bits, it could
> be iovas being within that supported range. This patch allows
> IOVA mode in such a case.
> 
> Indeed, memory initialization code can be modified for using lower
> virtual addresses than those used by the kernel for 64 bits processes
> by default, and therefore memsegs iovas can use 39 bits or less for
> most system. And this is likely 100% true for VMs.
> 
> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> ---
>   drivers/bus/pci/linux/pci.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> index 04648ac..215dc10 100644
> --- a/drivers/bus/pci/linux/pci.c
> +++ b/drivers/bus/pci/linux/pci.c
> @@ -588,10 +588,11 @@
>   	fclose(fp);
>   
>   	mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >> VTD_CAP_MGAW_SHIFT) + 1;
> -	if (mgaw < X86_VA_WIDTH)
> -		return false;
>   
> -	return true;
> +	if (!rte_eal_check_dma_mask(mgaw))
> +		return true;
> +	else
> +		return false;

return rte_eal_check_dma_mask(mgaw) == 0; ?

>   }
>   #elif defined(RTE_ARCH_PPC_64)
>   static bool
> @@ -615,13 +616,17 @@
>   {
>   	struct rte_pci_device *dev = NULL;
>   	struct rte_pci_driver *drv = NULL;
> +	int iommu_dma_mask_check_done = 0;
>   
>   	FOREACH_DRIVER_ON_PCIBUS(drv) {
>   		FOREACH_DEVICE_ON_PCIBUS(dev) {
>   			if (!rte_pci_match(drv, dev))
>   				continue;
> -			if (!pci_one_device_iommu_support_va(dev))
> -				return false;
> +			if (!iommu_dma_mask_check_done) {
> +				if (!pci_one_device_iommu_support_va(dev))
> +					return false;
> +				iommu_dma_mask_check_done  = 1;
> +			}
>   		}

The commit message doesn't explain why are we only checking a single 
device. Indeed, i am not 100% clear as to why, so some explanation in 
the commit message and preferably a comment in code would be more than 
welcome :)

>   	}
>   	return true;
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-10-03 12:50   ` Burakov, Anatoly
@ 2018-10-04 11:43     ` Alejandro Lucero
  2018-10-04 12:08       ` Burakov, Anatoly
  0 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 11:43 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: dev, dpdk stable

On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> > Linux kernel uses a really high address as starting address for
> > serving mmaps calls. If there exist addressing limitations and
> > IOVA mode is VA, this starting address is likely too high for
> > those devices. However, it is possible to use a lower address in
> > the process virtual address space as with 64 bits there is a lot
> > of available space.
> >
> > This patch adds an address hint as starting address for 64 bits
> > systems.
> >
> > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> > ---
>
> <snip>
>
> >
> >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
> PROT_READ,
> >                               mmap_flags, -1, 0);
> > +
> >               if (mapped_addr == MAP_FAILED && allow_shrink)
>
> Unintended whitespace change?
>
>
Yes. I'll fix it.


> >                       *size -= page_sz;
> > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
> > +
> > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
> > +                 mapped_addr != requested_addr) {
> > +                     /* hint was not used. Try with another offset */
> > +                     munmap(mapped_addr, map_sz);
> > +                     mapped_addr = MAP_FAILED;
> > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
> 0x100000000);
>
> Why not increment by page size? Sure, it could take some more time to
> allocate, but will result in less wasted memory.
>
>
I though the same or even using smaller increments than hugepage size.
Increment the address in such amount does not mean we are wasting memory
but just leaving space if some mmap fails. I think it is better to leave as
much as space as possible just in case the data allocated in the conflicted
area would need to grow in the future.


> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-10-04 11:43     ` Alejandro Lucero
@ 2018-10-04 12:08       ` Burakov, Anatoly
  2018-10-04 13:15         ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-04 12:08 UTC (permalink / raw)
  To: Alejandro Lucero; +Cc: dev, dpdk stable

On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> 
> 
> On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly 
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> 
>     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>      > Linux kernel uses a really high address as starting address for
>      > serving mmaps calls. If there exist addressing limitations and
>      > IOVA mode is VA, this starting address is likely too high for
>      > those devices. However, it is possible to use a lower address in
>      > the process virtual address space as with 64 bits there is a lot
>      > of available space.
>      >
>      > This patch adds an address hint as starting address for 64 bits
>      > systems.
>      >
>      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
>     <mailto:alejandro.lucero@netronome.com>>
>      > ---
> 
>     <snip>
> 
>      >
>      >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
>     PROT_READ,
>      >                               mmap_flags, -1, 0);
>      > +
>      >               if (mapped_addr == MAP_FAILED && allow_shrink)
> 
>     Unintended whitespace change?
> 
> 
> Yes. I'll fix it.
> 
>      >                       *size -= page_sz;
>      > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size
>      > 0);
>      > +
>      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
>      > +                 mapped_addr != requested_addr) {
>      > +                     /* hint was not used. Try with another
>     offset */
>      > +                     munmap(mapped_addr, map_sz);
>      > +                     mapped_addr = MAP_FAILED;
>      > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
>     0x100000000);
> 
>     Why not increment by page size? Sure, it could take some more time to
>     allocate, but will result in less wasted memory.
> 
> 
> I though the same or even using smaller increments than hugepage size. 
> Increment the address in such amount does not mean we are wasting memory 
> but just leaving space if some mmap fails. I think it is better to leave 
> as much as space as possible just in case the data allocated in the 
> conflicted area would need to grow in the future.

Not sure i follow. Could you give an example of a scenario where leaving 
huge chunks of memory free would be preferable to just adding page size 
and starting from page-size-aligned address next time we allocate?

> 
>     -- 
>     Thanks,
>     Anatoly
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] Fwd: [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses
       [not found]     ` <CAD+H991m6qauwX+P=muKe6bAjNLUrcBaGbxFXkMV60OVNvRgPg@mail.gmail.com>
@ 2018-10-04 12:59       ` Alejandro Lucero
  2018-10-04 15:39         ` Burakov, Anatoly
  0 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 12:59 UTC (permalink / raw)
  To: dev

I sent this email only to Anatoly. Sending it again to mailing list.

On Wed, Oct 3, 2018 at 1:43 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> > A device can suffer addressing limitations. This functions checks
> > memsegs have iovas within the supported range based on dma mask.
> >
> > PMD should use this during initialization if supported devices
> > suffer addressing limitations, returning an error if this function
> > returns memsegs out of range.
> >
> > Another potential usage is for emulated IOMMU hardware with addressing
> > limitations.
> >
> > It is necessary to save the most restricted dma mask for checking
> > memory allocated dynamically after initialization.
> >
> > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> > ---
> >   lib/librte_eal/common/eal_common_memory.c         | 56
> +++++++++++++++++++++++
> >   lib/librte_eal/common/include/rte_eal_memconfig.h |  3 ++
> >   lib/librte_eal/common/include/rte_memory.h        |  3 ++
> >   lib/librte_eal/common/malloc_heap.c               | 12 +++++
> >   lib/librte_eal/linuxapp/eal/eal.c                 |  2 +
> >   lib/librte_eal/rte_eal_version.map                |  1 +
> >   6 files changed, 77 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/eal_common_memory.c
> b/lib/librte_eal/common/eal_common_memory.c
> > index fbfb1b0..bdd8f44 100644
> > --- a/lib/librte_eal/common/eal_common_memory.c
> > +++ b/lib/librte_eal/common/eal_common_memory.c
> > @@ -383,6 +383,62 @@ struct virtiova {
> >       rte_memseg_walk(dump_memseg, f);
> >   }
> >
> > +static int
> > +check_iova(const struct rte_memseg_list *msl __rte_unused,
> > +             const struct rte_memseg *ms, void *arg)
> > +{
> > +     uint64_t *mask = arg;
> > +     rte_iova_t iova;
> > +
> > +     /* higher address within segment */
> > +     iova = (ms->iova + ms->len) - 1;
> > +     if (!(iova & *mask))
> > +             return 0;
> > +
> > +     RTE_LOG(INFO, EAL, "memseg iova %"PRIx64", len %zx, out of
> range\n",
> > +                        ms->iova, ms->len);
> > +
> > +     RTE_LOG(INFO, EAL, "\tusing dma mask %"PRIx64"\n", *mask);
>
> IMO putting these as INFO is overkill. I'd prefer not to spam the output
> unless it's really important. Can this go under DEBUG?
>
>
This checks comes from a device or from the alloc_pages_on_heap when
expanding memory. If the check discovers an address out of mask, a device
can not be used or the new memory can not be allocated. I think having this
info will help to understand why the device initialization or the memory
allocation are failing.


> Also, the message is misleading. You stop before you have a chance to
> check other masks, which may restrict them even further. You're
> outputting the message about using DMA mask XXX but this may not be the
> final DMA mask.
>

Well, this is the first triggering, and it is enough for reporting the
problem and avoiding the device or the new memory to be used.

Note that the mask is per device, and for the memory allocation case, it is
the most restrictive dma mask. So there are no other masks to try.



>
> > +     /* Stop the walk and change mask */
> > +     *mask = 0;
> > +     return 1;
> > +}
> > +
> > +#if defined(RTE_ARCH_64)
> > +#define MAX_DMA_MASK_BITS 63
> > +#else
> > +#define MAX_DMA_MASK_BITS 31
> > +#endif
> > +
> > +/* check memseg iovas are within the required range based on dma mask */
> > +int __rte_experimental
> > +rte_eal_check_dma_mask(uint8_t maskbits)
> > +{
> > +     struct rte_mem_config *mcfg =
> rte_eal_get_configuration()->mem_config;
> > +     uint64_t mask;
> > +
> > +     /* sanity check */
> > +     if (maskbits > MAX_DMA_MASK_BITS) {
> > +             RTE_LOG(INFO, EAL, "wrong dma mask size %u (Max: %u)\n",
> > +                                maskbits, MAX_DMA_MASK_BITS);
>
> Should be ERR, not INFO.
>
>
Right. I will change it.


> > +             return -1;
> > +     }
> > +
> > +     /* keep the more restricted maskbit */
> > +     if (!mcfg->dma_maskbits || maskbits < mcfg->dma_maskbits)
> > +             mcfg->dma_maskbits = maskbits;
>
> Do we need to modify mcfg->dma_maskbits before we know if we're going to
> fail? Suggest using a local variable maybe?
>
>
Yes, that's true. If the check fails, the device will not be used therefore
we do not need to keep that dma mask at all.
I will change the order here.
Thanks!


> Also, i think it's a good case for ternary:
>
> bits = mcfg->dma_maskbits == 0 ?
>         maskbits :
>         RTE_MIN(maskbits, mcfg->dma_maskbits);
>
> IMO the intention looks much clearer.
>
>
Agree.


> > +
> > +     /* create dma mask */
> > +     mask = ~((1ULL << maskbits) - 1);
> > +
> > +     rte_memseg_walk(check_iova, &mask);
> > +
> > +     if (!mask)
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> >   /* return the number of memory channels */
> >   unsigned rte_memory_get_nchannel(void)
> >   {
> > diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h
> b/lib/librte_eal/common/include/rte_eal_memconfig.h
> > index aff0688..aea44cb 100644
> > --- a/lib/librte_eal/common/include/rte_eal_memconfig.h
> > +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
> > @@ -77,6 +77,9 @@ struct rte_mem_config {
> >        * exact same address the primary process maps it.
> >        */
> >       uint64_t mem_cfg_addr;
> > +
> > +     /* keeps the more restricted dma mask */
> > +     uint8_t dma_maskbits;
>
> This needs to be documented as an ABI break in the 18.11 release notes.
>
>
Ok. I'll add that in the next version.
Thanks


>
> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-10-04 12:08       ` Burakov, Anatoly
@ 2018-10-04 13:15         ` Alejandro Lucero
  2018-10-04 15:43           ` Burakov, Anatoly
  0 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 13:15 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: dev, dpdk stable

On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> >
> >
> > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> >
> >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >      > Linux kernel uses a really high address as starting address for
> >      > serving mmaps calls. If there exist addressing limitations and
> >      > IOVA mode is VA, this starting address is likely too high for
> >      > those devices. However, it is possible to use a lower address in
> >      > the process virtual address space as with 64 bits there is a lot
> >      > of available space.
> >      >
> >      > This patch adds an address hint as starting address for 64 bits
> >      > systems.
> >      >
> >      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
> >     <mailto:alejandro.lucero@netronome.com>>
> >      > ---
> >
> >     <snip>
> >
> >      >
> >      >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
> >     PROT_READ,
> >      >                               mmap_flags, -1, 0);
> >      > +
> >      >               if (mapped_addr == MAP_FAILED && allow_shrink)
> >
> >     Unintended whitespace change?
> >
> >
> > Yes. I'll fix it.
> >
> >      >                       *size -= page_sz;
> >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size
> >      > 0);
> >      > +
> >      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
> >      > +                 mapped_addr != requested_addr) {
> >      > +                     /* hint was not used. Try with another
> >     offset */
> >      > +                     munmap(mapped_addr, map_sz);
> >      > +                     mapped_addr = MAP_FAILED;
> >      > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
> >     0x100000000);
> >
> >     Why not increment by page size? Sure, it could take some more time to
> >     allocate, but will result in less wasted memory.
> >
> >
> > I though the same or even using smaller increments than hugepage size.
> > Increment the address in such amount does not mean we are wasting memory
> > but just leaving space if some mmap fails. I think it is better to leave
> > as much as space as possible just in case the data allocated in the
> > conflicted area would need to grow in the future.
>
> Not sure i follow. Could you give an example of a scenario where leaving
> huge chunks of memory free would be preferable to just adding page size
> and starting from page-size-aligned address next time we allocate?
>
>
Usually there is nothing at 4GB address in 64 bit processes, usually the
text section being the first process region mapped and currently at far
higher than 4GB. If there is something mapped there before executing the
EAL hugepage/memory initialization code, not sure what it will be for, but
maybe it needs to grow using contiguous virtual addresses. As I say, no
idea what this could be used for, but the shorter the space when trying
again in this code, the less likely that flexibility could be there.

Maybe making the increment smaller just makes sense for virtual address
space randomization for security reasons.

Anyway, there is a lot of space with 64 bits, and, IMHO, this should not be
a problem while the increment is negligible against 64 bits address space
size, and 4GB are so negligible in this case as 4 bytes are to 4GB.


> >
> >     --
> >     Thanks,
> >     Anatoly
> >
>
>
> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode
  2018-10-03 12:55   ` Burakov, Anatoly
@ 2018-10-04 13:35     ` Alejandro Lucero
  2018-10-04 15:49       ` Burakov, Anatoly
  0 siblings, 1 reply; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 13:35 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Maxime Coquelin

On Wed, Oct 3, 2018 at 1:56 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> > Although VT-d emulation currently only supports 39 bits, it could
> > be iovas being within that supported range. This patch allows
> > IOVA mode in such a case.
> >
> > Indeed, memory initialization code can be modified for using lower
> > virtual addresses than those used by the kernel for 64 bits processes
> > by default, and therefore memsegs iovas can use 39 bits or less for
> > most system. And this is likely 100% true for VMs.
> >
> > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> > ---
> >   drivers/bus/pci/linux/pci.c | 15 ++++++++++-----
> >   1 file changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
> > index 04648ac..215dc10 100644
> > --- a/drivers/bus/pci/linux/pci.c
> > +++ b/drivers/bus/pci/linux/pci.c
> > @@ -588,10 +588,11 @@
> >       fclose(fp);
> >
> >       mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >> VTD_CAP_MGAW_SHIFT) +
> 1;
> > -     if (mgaw < X86_VA_WIDTH)
> > -             return false;
> >
> > -     return true;
> > +     if (!rte_eal_check_dma_mask(mgaw))
> > +             return true;
> > +     else
> > +             return false;
>
> return rte_eal_check_dma_mask(mgaw) == 0; ?
>

I guess that works and is more elegant.
Thanks.


>
> >   }
> >   #elif defined(RTE_ARCH_PPC_64)
> >   static bool
> > @@ -615,13 +616,17 @@
> >   {
> >       struct rte_pci_device *dev = NULL;
> >       struct rte_pci_driver *drv = NULL;
> > +     int iommu_dma_mask_check_done = 0;
> >
> >       FOREACH_DRIVER_ON_PCIBUS(drv) {
> >               FOREACH_DEVICE_ON_PCIBUS(dev) {
> >                       if (!rte_pci_match(drv, dev))
> >                               continue;
> > -                     if (!pci_one_device_iommu_support_va(dev))
> > -                             return false;
> > +                     if (!iommu_dma_mask_check_done) {
> > +                             if (!pci_one_device_iommu_support_va(dev))
> > +                                     return false;
> > +                             iommu_dma_mask_check_done  = 1;
> > +                     }
> >               }
>
> The commit message doesn't explain why are we only checking a single
> device. Indeed, i am not 100% clear as to why, so some explanation in
> the commit message and preferably a comment in code would be more than
> welcome :)
>
>
Because the pci_one_device_iommu_support_va function does always the same
whatever the device is used in the call.
The code uses the device for looking at /sys/bus/pci/devices/   but then it
uses a link to iommu which will be the same for all
the devices. Note that some can refer to dmar0 and others to dmar1, but the
IOMMU capabilities are the same.

The limitation here is not a PCI device but the IOMMU hardware itself. The
first call to pci_one_device_iommu_support_va will check
if all the hugepages addresses are within the supported DMA range by the
IOMMU hw. If it fails, that is.

Now that I'm explaining this, I notice it is the same for any case. If the
check is good, no more checks are needed. This assumes there is just one
IOMMU hardware or if more than one (I have NUMA systems with one IOMMU unit
per socket) they are all the same hardware version. Adding Maxime in the
thread for confirming this and asking him about my previous statement.


> >       }
> >       return true;
> >
>
>
> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] Fwd: [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses
  2018-10-04 12:59       ` [dpdk-dev] Fwd: " Alejandro Lucero
@ 2018-10-04 15:39         ` Burakov, Anatoly
  2018-10-04 17:41           ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-04 15:39 UTC (permalink / raw)
  To: Alejandro Lucero, dev

On 04-Oct-18 1:59 PM, Alejandro Lucero wrote:
> I sent this email only to Anatoly. Sending it again to mailing list.
> 
> On Wed, Oct 3, 2018 at 1:43 PM Burakov, Anatoly <anatoly.burakov@intel.com>
> wrote:
> 
>> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>>> A device can suffer addressing limitations. This functions checks
>>> memsegs have iovas within the supported range based on dma mask.
>>>
>>> PMD should use this during initialization if supported devices
>>> suffer addressing limitations, returning an error if this function
>>> returns memsegs out of range.
>>>
>>> Another potential usage is for emulated IOMMU hardware with addressing
>>> limitations.
>>>
>>> It is necessary to save the most restricted dma mask for checking
>>> memory allocated dynamically after initialization.
>>>
>>> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
>>> ---
>>>    lib/librte_eal/common/eal_common_memory.c         | 56
>> +++++++++++++++++++++++
>>>    lib/librte_eal/common/include/rte_eal_memconfig.h |  3 ++
>>>    lib/librte_eal/common/include/rte_memory.h        |  3 ++
>>>    lib/librte_eal/common/malloc_heap.c               | 12 +++++
>>>    lib/librte_eal/linuxapp/eal/eal.c                 |  2 +
>>>    lib/librte_eal/rte_eal_version.map                |  1 +
>>>    6 files changed, 77 insertions(+)
>>>
>>> diff --git a/lib/librte_eal/common/eal_common_memory.c
>> b/lib/librte_eal/common/eal_common_memory.c
>>> index fbfb1b0..bdd8f44 100644
>>> --- a/lib/librte_eal/common/eal_common_memory.c
>>> +++ b/lib/librte_eal/common/eal_common_memory.c
>>> @@ -383,6 +383,62 @@ struct virtiova {
>>>        rte_memseg_walk(dump_memseg, f);
>>>    }
>>>
>>> +static int
>>> +check_iova(const struct rte_memseg_list *msl __rte_unused,
>>> +             const struct rte_memseg *ms, void *arg)
>>> +{
>>> +     uint64_t *mask = arg;
>>> +     rte_iova_t iova;
>>> +
>>> +     /* higher address within segment */
>>> +     iova = (ms->iova + ms->len) - 1;
>>> +     if (!(iova & *mask))
>>> +             return 0;
>>> +
>>> +     RTE_LOG(INFO, EAL, "memseg iova %"PRIx64", len %zx, out of
>> range\n",
>>> +                        ms->iova, ms->len);
>>> +
>>> +     RTE_LOG(INFO, EAL, "\tusing dma mask %"PRIx64"\n", *mask);
>>
>> IMO putting these as INFO is overkill. I'd prefer not to spam the output
>> unless it's really important. Can this go under DEBUG?
>>
>>
> This checks comes from a device or from the alloc_pages_on_heap when
> expanding memory. If the check discovers an address out of mask, a device
> can not be used or the new memory can not be allocated. I think having this
> info will help to understand why the device initialization or the memory
> allocation are failing.
> 

If this text is only displayed whenever there's an error, the log output 
should be ERR, not INFO. If the error may or may not happen depending on 
who called this function, then this information is not important enough 
to display to the user (it should be displayed in the error handler of 
the caller), and DEBUG should suffice.

> 
>> Also, the message is misleading. You stop before you have a chance to
>> check other masks, which may restrict them even further. You're
>> outputting the message about using DMA mask XXX but this may not be the
>> final DMA mask.
>>
> 
> Well, this is the first triggering, and it is enough for reporting the
> problem and avoiding the device or the new memory to be used.
> 
> Note that the mask is per device, and for the memory allocation case, it is
> the most restrictive dma mask. So there are no other masks to try.

Fair enough.

> 
> 
> 
>>
>>> +     /* Stop the walk and change mask */
>>> +     *mask = 0;
>>> +     return 1;

No need for out-of-band communication, _walk() function will return 1 if 
walk was stopped prematurely. Just check return value of walk().

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-10-04 13:15         ` Alejandro Lucero
@ 2018-10-04 15:43           ` Burakov, Anatoly
  2018-10-04 17:58             ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-04 15:43 UTC (permalink / raw)
  To: Alejandro Lucero; +Cc: dev, dpdk stable

On 04-Oct-18 2:15 PM, Alejandro Lucero wrote:
> 
> 
> On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly 
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> 
>     On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
>      >
>      >
>      > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
>      > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>
>     <mailto:anatoly.burakov@intel.com
>     <mailto:anatoly.burakov@intel.com>>> wrote:
>      >
>      >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>      >      > Linux kernel uses a really high address as starting
>     address for
>      >      > serving mmaps calls. If there exist addressing limitations and
>      >      > IOVA mode is VA, this starting address is likely too high for
>      >      > those devices. However, it is possible to use a lower
>     address in
>      >      > the process virtual address space as with 64 bits there is
>     a lot
>      >      > of available space.
>      >      >
>      >      > This patch adds an address hint as starting address for 64
>     bits
>      >      > systems.
>      >      >
>      >      > Signed-off-by: Alejandro Lucero
>     <alejandro.lucero@netronome.com <mailto:alejandro.lucero@netronome.com>
>      >     <mailto:alejandro.lucero@netronome.com
>     <mailto:alejandro.lucero@netronome.com>>>
>      >      > ---
>      >
>      >     <snip>
>      >
>      >      >
>      >      >               mapped_addr = mmap(requested_addr,
>     (size_t)map_sz,
>      >     PROT_READ,
>      >      >                               mmap_flags, -1, 0);
>      >      > +
>      >      >               if (mapped_addr == MAP_FAILED && allow_shrink)
>      >
>      >     Unintended whitespace change?
>      >
>      >
>      > Yes. I'll fix it.
>      >
>      >      >                       *size -= page_sz;
>      >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED
>     && *size
>      >      > 0);
>      >      > +
>      >      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
>      >      > +                 mapped_addr != requested_addr) {
>      >      > +                     /* hint was not used. Try with another
>      >     offset */
>      >      > +                     munmap(mapped_addr, map_sz);
>      >      > +                     mapped_addr = MAP_FAILED;
>      >      > +                     next_baseaddr =
>     RTE_PTR_ADD(next_baseaddr,
>      >     0x100000000);
>      >
>      >     Why not increment by page size? Sure, it could take some more
>     time to
>      >     allocate, but will result in less wasted memory.
>      >
>      >
>      > I though the same or even using smaller increments than hugepage
>     size.
>      > Increment the address in such amount does not mean we are wasting
>     memory
>      > but just leaving space if some mmap fails. I think it is better
>     to leave
>      > as much as space as possible just in case the data allocated in the
>      > conflicted area would need to grow in the future.
> 
>     Not sure i follow. Could you give an example of a scenario where
>     leaving
>     huge chunks of memory free would be preferable to just adding page size
>     and starting from page-size-aligned address next time we allocate?
> 
> 
> Usually there is nothing at 4GB address in 64 bit processes, usually the 
> text section being the first process region mapped and currently at far 
> higher than 4GB. If there is something mapped there before executing the 
> EAL hugepage/memory initialization code, not sure what it will be for, 
> but maybe it needs to grow using contiguous virtual addresses. As I say, 
> no idea what this could be used for, but the shorter the space when 
> trying again in this code, the less likely that flexibility could be there.

But you're already leaving holes there, what difference does it make? I 
mean, it's not important, i'm just not sure why the arbitrary 
0x100000000 increment instead of page size. Most of the calls into this 
function are from init code, and with init code we're usually calling 
this function quite a few times in succession (especially during memseg 
list allocations), so you are skipping space that could've been used for 
that.

(btw if you are to use this constant, it should be a macro, not a raw 
constant)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode
  2018-10-04 13:35     ` Alejandro Lucero
@ 2018-10-04 15:49       ` Burakov, Anatoly
  2018-10-04 17:59         ` Alejandro Lucero
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-10-04 15:49 UTC (permalink / raw)
  To: Alejandro Lucero, dev, Maxime Coquelin

On 04-Oct-18 2:35 PM, Alejandro Lucero wrote:
> 
> 
> On Wed, Oct 3, 2018 at 1:56 PM Burakov, Anatoly 
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> 
>     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>      > Although VT-d emulation currently only supports 39 bits, it could
>      > be iovas being within that supported range. This patch allows
>      > IOVA mode in such a case.
>      >
>      > Indeed, memory initialization code can be modified for using lower
>      > virtual addresses than those used by the kernel for 64 bits processes
>      > by default, and therefore memsegs iovas can use 39 bits or less for
>      > most system. And this is likely 100% true for VMs.
>      >
>      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
>     <mailto:alejandro.lucero@netronome.com>>
>      > ---
>      >   drivers/bus/pci/linux/pci.c | 15 ++++++++++-----
>      >   1 file changed, 10 insertions(+), 5 deletions(-)
>      >
>      > diff --git a/drivers/bus/pci/linux/pci.c
>     b/drivers/bus/pci/linux/pci.c
>      > index 04648ac..215dc10 100644
>      > --- a/drivers/bus/pci/linux/pci.c
>      > +++ b/drivers/bus/pci/linux/pci.c
>      > @@ -588,10 +588,11 @@
>      >       fclose(fp);
>      >
>      >       mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >>
>     VTD_CAP_MGAW_SHIFT) + 1;
>      > -     if (mgaw < X86_VA_WIDTH)
>      > -             return false;
>      >
>      > -     return true;
>      > +     if (!rte_eal_check_dma_mask(mgaw))
>      > +             return true;
>      > +     else
>      > +             return false;
> 
>     return rte_eal_check_dma_mask(mgaw) == 0; ?
> 
> 
> I guess that works and is more elegant.
> Thanks.
> 
> 
>      >   }
>      >   #elif defined(RTE_ARCH_PPC_64)
>      >   static bool
>      > @@ -615,13 +616,17 @@
>      >   {
>      >       struct rte_pci_device *dev = NULL;
>      >       struct rte_pci_driver *drv = NULL;
>      > +     int iommu_dma_mask_check_done = 0;
>      >
>      >       FOREACH_DRIVER_ON_PCIBUS(drv) {
>      >               FOREACH_DEVICE_ON_PCIBUS(dev) {
>      >                       if (!rte_pci_match(drv, dev))
>      >                               continue;
>      > -                     if (!pci_one_device_iommu_support_va(dev))
>      > -                             return false;
>      > +                     if (!iommu_dma_mask_check_done) {
>      > +                             if
>     (!pci_one_device_iommu_support_va(dev))
>      > +                                     return false;
>      > +                             iommu_dma_mask_check_done  = 1;
>      > +                     }
>      >               }
> 
>     The commit message doesn't explain why are we only checking a single
>     device. Indeed, i am not 100% clear as to why, so some explanation in
>     the commit message and preferably a comment in code would be more than
>     welcome :)
> 
> 
> Because the pci_one_device_iommu_support_va function does always the 
> same whatever the device is used in the call.

So, this code was always wrong and needlessly checked each device when 
it could've checked it a single time? OK, that makes it a bit clearer. 
Still, needs to be documented in comments/commit message :) The commit 
message IMO looks quite irrelevant to what happens in the commit. It 
almost feels like this commit should be split in two - first change the 
mgaw check, and then fix the PCI bus code to not check needlessly.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] Fwd: [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses
  2018-10-04 15:39         ` Burakov, Anatoly
@ 2018-10-04 17:41           ` Alejandro Lucero
  0 siblings, 0 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 17:41 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: dev

On Thu, Oct 4, 2018 at 4:39 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 1:59 PM, Alejandro Lucero wrote:
> > I sent this email only to Anatoly. Sending it again to mailing list.
> >
> > On Wed, Oct 3, 2018 at 1:43 PM Burakov, Anatoly <
> anatoly.burakov@intel.com>
> > wrote:
> >
> >> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >>> A device can suffer addressing limitations. This functions checks
> >>> memsegs have iovas within the supported range based on dma mask.
> >>>
> >>> PMD should use this during initialization if supported devices
> >>> suffer addressing limitations, returning an error if this function
> >>> returns memsegs out of range.
> >>>
> >>> Another potential usage is for emulated IOMMU hardware with addressing
> >>> limitations.
> >>>
> >>> It is necessary to save the most restricted dma mask for checking
> >>> memory allocated dynamically after initialization.
> >>>
> >>> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> >>> ---
> >>>    lib/librte_eal/common/eal_common_memory.c         | 56
> >> +++++++++++++++++++++++
> >>>    lib/librte_eal/common/include/rte_eal_memconfig.h |  3 ++
> >>>    lib/librte_eal/common/include/rte_memory.h        |  3 ++
> >>>    lib/librte_eal/common/malloc_heap.c               | 12 +++++
> >>>    lib/librte_eal/linuxapp/eal/eal.c                 |  2 +
> >>>    lib/librte_eal/rte_eal_version.map                |  1 +
> >>>    6 files changed, 77 insertions(+)
> >>>
> >>> diff --git a/lib/librte_eal/common/eal_common_memory.c
> >> b/lib/librte_eal/common/eal_common_memory.c
> >>> index fbfb1b0..bdd8f44 100644
> >>> --- a/lib/librte_eal/common/eal_common_memory.c
> >>> +++ b/lib/librte_eal/common/eal_common_memory.c
> >>> @@ -383,6 +383,62 @@ struct virtiova {
> >>>        rte_memseg_walk(dump_memseg, f);
> >>>    }
> >>>
> >>> +static int
> >>> +check_iova(const struct rte_memseg_list *msl __rte_unused,
> >>> +             const struct rte_memseg *ms, void *arg)
> >>> +{
> >>> +     uint64_t *mask = arg;
> >>> +     rte_iova_t iova;
> >>> +
> >>> +     /* higher address within segment */
> >>> +     iova = (ms->iova + ms->len) - 1;
> >>> +     if (!(iova & *mask))
> >>> +             return 0;
> >>> +
> >>> +     RTE_LOG(INFO, EAL, "memseg iova %"PRIx64", len %zx, out of
> >> range\n",
> >>> +                        ms->iova, ms->len);
> >>> +
> >>> +     RTE_LOG(INFO, EAL, "\tusing dma mask %"PRIx64"\n", *mask);
> >>
> >> IMO putting these as INFO is overkill. I'd prefer not to spam the output
> >> unless it's really important. Can this go under DEBUG?
> >>
> >>
> > This checks comes from a device or from the alloc_pages_on_heap when
> > expanding memory. If the check discovers an address out of mask, a device
> > can not be used or the new memory can not be allocated. I think having
> this
> > info will help to understand why the device initialization or the memory
> > allocation are failing.
> >
>
> If this text is only displayed whenever there's an error, the log output
> should be ERR, not INFO. If the error may or may not happen depending on
> who called this function, then this information is not important enough
> to display to the user (it should be displayed in the error handler of
> the caller), and DEBUG should suffice.
>
>
Ok. Makes sense. I will change it.
Thanks


> >
> >> Also, the message is misleading. You stop before you have a chance to
> >> check other masks, which may restrict them even further. You're
> >> outputting the message about using DMA mask XXX but this may not be the
> >> final DMA mask.
> >>
> >
> > Well, this is the first triggering, and it is enough for reporting the
> > problem and avoiding the device or the new memory to be used.
> >
> > Note that the mask is per device, and for the memory allocation case, it
> is
> > the most restrictive dma mask. So there are no other masks to try.
>
> Fair enough.
>
> >
> >
> >
> >>
> >>> +     /* Stop the walk and change mask */
> >>> +     *mask = 0;
> >>> +     return 1;
>
> No need for out-of-band communication, _walk() function will return 1 if
> walk was stopped prematurely. Just check return value of walk().
>
>
Yes, that's right. I will use the return from the walk function instead.
Thanks



> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages
  2018-10-04 15:43           ` Burakov, Anatoly
@ 2018-10-04 17:58             ` Alejandro Lucero
  0 siblings, 0 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 17:58 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: dev, dpdk stable

On Thu, Oct 4, 2018 at 4:43 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 2:15 PM, Alejandro Lucero wrote:
> >
> >
> > On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> >
> >     On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> >      >
> >      >
> >      > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
> >      > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>
> >     <mailto:anatoly.burakov@intel.com
> >     <mailto:anatoly.burakov@intel.com>>> wrote:
> >      >
> >      >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >      >      > Linux kernel uses a really high address as starting
> >     address for
> >      >      > serving mmaps calls. If there exist addressing limitations
> and
> >      >      > IOVA mode is VA, this starting address is likely too high
> for
> >      >      > those devices. However, it is possible to use a lower
> >     address in
> >      >      > the process virtual address space as with 64 bits there is
> >     a lot
> >      >      > of available space.
> >      >      >
> >      >      > This patch adds an address hint as starting address for 64
> >     bits
> >      >      > systems.
> >      >      >
> >      >      > Signed-off-by: Alejandro Lucero
> >     <alejandro.lucero@netronome.com <mailto:
> alejandro.lucero@netronome.com>
> >      >     <mailto:alejandro.lucero@netronome.com
> >     <mailto:alejandro.lucero@netronome.com>>>
> >      >      > ---
> >      >
> >      >     <snip>
> >      >
> >      >      >
> >      >      >               mapped_addr = mmap(requested_addr,
> >     (size_t)map_sz,
> >      >     PROT_READ,
> >      >      >                               mmap_flags, -1, 0);
> >      >      > +
> >      >      >               if (mapped_addr == MAP_FAILED &&
> allow_shrink)
> >      >
> >      >     Unintended whitespace change?
> >      >
> >      >
> >      > Yes. I'll fix it.
> >      >
> >      >      >                       *size -= page_sz;
> >      >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED
> >     && *size
> >      >      > 0);
> >      >      > +
> >      >      > +             if (mapped_addr != MAP_FAILED &&
> addr_is_hint &&
> >      >      > +                 mapped_addr != requested_addr) {
> >      >      > +                     /* hint was not used. Try with
> another
> >      >     offset */
> >      >      > +                     munmap(mapped_addr, map_sz);
> >      >      > +                     mapped_addr = MAP_FAILED;
> >      >      > +                     next_baseaddr =
> >     RTE_PTR_ADD(next_baseaddr,
> >      >     0x100000000);
> >      >
> >      >     Why not increment by page size? Sure, it could take some more
> >     time to
> >      >     allocate, but will result in less wasted memory.
> >      >
> >      >
> >      > I though the same or even using smaller increments than hugepage
> >     size.
> >      > Increment the address in such amount does not mean we are wasting
> >     memory
> >      > but just leaving space if some mmap fails. I think it is better
> >     to leave
> >      > as much as space as possible just in case the data allocated in
> the
> >      > conflicted area would need to grow in the future.
> >
> >     Not sure i follow. Could you give an example of a scenario where
> >     leaving
> >     huge chunks of memory free would be preferable to just adding page
> size
> >     and starting from page-size-aligned address next time we allocate?
> >
> >
> > Usually there is nothing at 4GB address in 64 bit processes, usually the
> > text section being the first process region mapped and currently at far
> > higher than 4GB. If there is something mapped there before executing the
> > EAL hugepage/memory initialization code, not sure what it will be for,
> > but maybe it needs to grow using contiguous virtual addresses. As I say,
> > no idea what this could be used for, but the shorter the space when
> > trying again in this code, the less likely that flexibility could be
> there.
>
> But you're already leaving holes there, what difference does it make? I
> mean, it's not important, i'm just not sure why the arbitrary
> 0x100000000 increment instead of page size. Most of the calls into this
> function are from init code, and with init code we're usually calling
> this function quite a few times in succession (especially during memseg
> list allocations), so you are skipping space that could've been used for
> that.
>
>
Note that the increment is pagesize if there is no problem and the 4GB
increment is just used if that specific address fails.
I'm not against change this to always use hugepage size instead and it
seems my previous comment did not convince you. So I'll change that because
I can not sustain my case without any real data. :-)



> (btw if you are to use this constant, it should be a macro, not a raw
> constant)
>
> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode
  2018-10-04 15:49       ` Burakov, Anatoly
@ 2018-10-04 17:59         ` Alejandro Lucero
  0 siblings, 0 replies; 22+ messages in thread
From: Alejandro Lucero @ 2018-10-04 17:59 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: dev, Maxime Coquelin

On Thu, Oct 4, 2018 at 4:49 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 2:35 PM, Alejandro Lucero wrote:
> >
> >
> > On Wed, Oct 3, 2018 at 1:56 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> >
> >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >      > Although VT-d emulation currently only supports 39 bits, it could
> >      > be iovas being within that supported range. This patch allows
> >      > IOVA mode in such a case.
> >      >
> >      > Indeed, memory initialization code can be modified for using lower
> >      > virtual addresses than those used by the kernel for 64 bits
> processes
> >      > by default, and therefore memsegs iovas can use 39 bits or less
> for
> >      > most system. And this is likely 100% true for VMs.
> >      >
> >      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
> >     <mailto:alejandro.lucero@netronome.com>>
> >      > ---
> >      >   drivers/bus/pci/linux/pci.c | 15 ++++++++++-----
> >      >   1 file changed, 10 insertions(+), 5 deletions(-)
> >      >
> >      > diff --git a/drivers/bus/pci/linux/pci.c
> >     b/drivers/bus/pci/linux/pci.c
> >      > index 04648ac..215dc10 100644
> >      > --- a/drivers/bus/pci/linux/pci.c
> >      > +++ b/drivers/bus/pci/linux/pci.c
> >      > @@ -588,10 +588,11 @@
> >      >       fclose(fp);
> >      >
> >      >       mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >>
> >     VTD_CAP_MGAW_SHIFT) + 1;
> >      > -     if (mgaw < X86_VA_WIDTH)
> >      > -             return false;
> >      >
> >      > -     return true;
> >      > +     if (!rte_eal_check_dma_mask(mgaw))
> >      > +             return true;
> >      > +     else
> >      > +             return false;
> >
> >     return rte_eal_check_dma_mask(mgaw) == 0; ?
> >
> >
> > I guess that works and is more elegant.
> > Thanks.
> >
> >
> >      >   }
> >      >   #elif defined(RTE_ARCH_PPC_64)
> >      >   static bool
> >      > @@ -615,13 +616,17 @@
> >      >   {
> >      >       struct rte_pci_device *dev = NULL;
> >      >       struct rte_pci_driver *drv = NULL;
> >      > +     int iommu_dma_mask_check_done = 0;
> >      >
> >      >       FOREACH_DRIVER_ON_PCIBUS(drv) {
> >      >               FOREACH_DEVICE_ON_PCIBUS(dev) {
> >      >                       if (!rte_pci_match(drv, dev))
> >      >                               continue;
> >      > -                     if (!pci_one_device_iommu_support_va(dev))
> >      > -                             return false;
> >      > +                     if (!iommu_dma_mask_check_done) {
> >      > +                             if
> >     (!pci_one_device_iommu_support_va(dev))
> >      > +                                     return false;
> >      > +                             iommu_dma_mask_check_done  = 1;
> >      > +                     }
> >      >               }
> >
> >     The commit message doesn't explain why are we only checking a single
> >     device. Indeed, i am not 100% clear as to why, so some explanation in
> >     the commit message and preferably a comment in code would be more
> than
> >     welcome :)
> >
> >
> > Because the pci_one_device_iommu_support_va function does always the
> > same whatever the device is used in the call.
>
> So, this code was always wrong and needlessly checked each device when
> it could've checked it a single time? OK, that makes it a bit clearer.
> Still, needs to be documented in comments/commit message :) The commit
> message IMO looks quite irrelevant to what happens in the commit. It
> almost feels like this commit should be split in two - first change the
> mgaw check, and then fix the PCI bus code to not check needlessly.
>
>
Ok. Maybe that's better. I will do that in next version.
Thanks


> --
> Thanks,
> Anatoly
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-10-04 17:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-31 12:50 [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 1/5] mem: add function for checking memsegs IOVAs addresses Alejandro Lucero
2018-10-03 12:43   ` Burakov, Anatoly
     [not found]     ` <CAD+H991m6qauwX+P=muKe6bAjNLUrcBaGbxFXkMV60OVNvRgPg@mail.gmail.com>
2018-10-04 12:59       ` [dpdk-dev] Fwd: " Alejandro Lucero
2018-10-04 15:39         ` Burakov, Anatoly
2018-10-04 17:41           ` Alejandro Lucero
2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 2/5] mem: use address hint for mapping hugepages Alejandro Lucero
2018-10-03 12:50   ` Burakov, Anatoly
2018-10-04 11:43     ` Alejandro Lucero
2018-10-04 12:08       ` Burakov, Anatoly
2018-10-04 13:15         ` Alejandro Lucero
2018-10-04 15:43           ` Burakov, Anatoly
2018-10-04 17:58             ` Alejandro Lucero
2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 3/5] bus/pci: use IOVAs check when setting IOVA mode Alejandro Lucero
2018-10-03 12:55   ` Burakov, Anatoly
2018-10-04 13:35     ` Alejandro Lucero
2018-10-04 15:49       ` Burakov, Anatoly
2018-10-04 17:59         ` Alejandro Lucero
2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 4/5] net/nfp: check hugepages IOVAs based on DMA mask Alejandro Lucero
2018-08-31 12:50 ` [dpdk-dev] [PATCH v2 5/5] net/nfp: support IOVA VA mode Alejandro Lucero
2018-10-02 16:33 ` [dpdk-dev] [PATCH v2 0/5] use IOVAs check based on DMA mask Alejandro Lucero
2018-10-02 21:21   ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).