DPDK patches and discussions
 help / color / mirror / Atom feed
From: Bruce Richardson <bruce.richardson@intel.com>
To: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Cc: Gowrishankar <gowrishankar.m@linux.vnet.ibm.com>,
	dev@dpdk.org, chaozhu@linux.vnet.ibm.com,
	David Marchand <david.marchand@6wind.com>
Subject: Re: [dpdk-dev] [PATCH] eal/ppc: fix secondary process to map hugepages in correct order
Date: Tue, 22 Mar 2016 17:10:46 +0000	[thread overview]
Message-ID: <20160322171046.GG20448@bricha3-MOBL3> (raw)
In-Reply-To: <56F17454.3010907@intel.com>

On Tue, Mar 22, 2016 at 04:35:32PM +0000, Sergio Gonzalez Monroy wrote:
> First of all, forgive my ignorance regarding ppc64 and if the questions are
> naive but after having a
> look to the already existing code for ppc64 and this patch now, why are we
> doing this reverse mapping at all?
> 
> I guess the question revolves around the comment in eal_memory.c:
> 1316                 /* On PPC64 architecture, the mmap always start from
> higher
> 1317                  * virtual address to lower address. Here, both the
> physical
> 1318                  * address and virtual address are in descending order
> */
> 
> From looking at the code, for ppc64 we do qsort in reverse order and
> thereafter everything looks to be is
> done to account for that reverse sorting.
> 
> CC: Chao Zhu and David Marchand as original author and reviewer of the code.
> 
> Sergio
>

Just to add my 2c here. At one point, with I believe some i686 installs - don't
remember the specific OS/kernel, we found that the mmap calls were returning
the highest free address first and then working downwards - must like seems
to be described here. To fix this we changed the mmap code from assuming that
addresses are mapped upwards, to instead explicitly requesting a large free
block of memory (mmap of /dev/zero) to find a free address space
range of the correct size, and then explicitly mmapping each individual page to
the appropriate place in that free range. With this scheme it didn't matter whether
the OS tried to mmap the pages from the highest or lowest address because we
always told the OS where to put the page (and we knew the slot was free from
the earlier block mmap).
Would this scheme not also work for PPC in a similar way? (Again, forgive
unfamiliarity with PPC! :-) )

/Bruce

> 
> On 07/03/2016 14:13, Gowrishankar wrote:
> >From: Gowri Shankar <gowrishankar.m@linux.vnet.ibm.com>
> >
> >For a secondary process address space to map hugepages from every segment of
> >primary process, hugepage_file entries has to be mapped reversely from the
> >list that primary process updated for every segment. This is for a reason that,
> >in ppc64, hugepages are sorted for decrementing addresses.
> >
> >Signed-off-by: Gowrishankar <gowrishankar.m@linux.vnet.ibm.com>
> >---
> >  lib/librte_eal/linuxapp/eal/eal_memory.c |   26 ++++++++++++++++----------
> >  1 file changed, 16 insertions(+), 10 deletions(-)
> >
> >diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >index 5b9132c..6aea5d0 100644
> >--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> >+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >@@ -1400,7 +1400,7 @@ rte_eal_hugepage_attach(void)
> >  {
> >  	const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> >  	const struct hugepage_file *hp = NULL;
> >-	unsigned num_hp = 0;
> >+	unsigned num_hp = 0, mapped_hp = 0;
> >  	unsigned i, s = 0; /* s used to track the segment number */
> >  	off_t size;
> >  	int fd, fd_zero = -1, fd_hugepage = -1;
> >@@ -1486,14 +1486,12 @@ rte_eal_hugepage_attach(void)
> >  		goto error;
> >  	}
> >-	num_hp = size / sizeof(struct hugepage_file);
> >-	RTE_LOG(DEBUG, EAL, "Analysing %u files\n", num_hp);
> >-
> >  	s = 0;
> >  	while (s < RTE_MAX_MEMSEG && mcfg->memseg[s].len > 0){
> >  		void *addr, *base_addr;
> >  		uintptr_t offset = 0;
> >  		size_t mapping_size;
> >+		unsigned int index;
> >  #ifdef RTE_LIBRTE_IVSHMEM
> >  		/*
> >  		 * if segment has ioremap address set, it's an IVSHMEM segment and
> >@@ -1504,6 +1502,8 @@ rte_eal_hugepage_attach(void)
> >  			continue;
> >  		}
> >  #endif
> >+		num_hp = mcfg->memseg[s].len / mcfg->memseg[s].hugepage_sz;
> >+		RTE_LOG(DEBUG, EAL, "Analysing %u files in segment %u\n", num_hp, s);
> >  		/*
> >  		 * free previously mapped memory so we can map the
> >  		 * hugepages into the space
> >@@ -1514,18 +1514,23 @@ rte_eal_hugepage_attach(void)
> >  		/* find the hugepages for this segment and map them
> >  		 * we don't need to worry about order, as the server sorted the
> >  		 * entries before it did the second mmap of them */
> >+#ifdef RTE_ARCH_PPC_64
> >+		for (i = num_hp-1; i < num_hp && offset < mcfg->memseg[s].len; i--){
> >+#else
> >  		for (i = 0; i < num_hp && offset < mcfg->memseg[s].len; i++){
> >-			if (hp[i].memseg_id == (int)s){
> >-				fd = open(hp[i].filepath, O_RDWR);
> >+#endif
> >+			index = i + mapped_hp;
> >+			if (hp[index].memseg_id == (int)s){
> >+				fd = open(hp[index].filepath, O_RDWR);
> >  				if (fd < 0) {
> >  					RTE_LOG(ERR, EAL, "Could not open %s\n",
> >-						hp[i].filepath);
> >+						hp[index].filepath);
> >  					goto error;
> >  				}
> >  #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
> >-				mapping_size = hp[i].size * hp[i].repeated;
> >+				mapping_size = hp[index].size * hp[index].repeated;
> >  #else
> >-				mapping_size = hp[i].size;
> >+				mapping_size = hp[index].size;
> >  #endif
> >  				addr = mmap(RTE_PTR_ADD(base_addr, offset),
> >  						mapping_size, PROT_READ | PROT_WRITE,
> >@@ -1534,7 +1539,7 @@ rte_eal_hugepage_attach(void)
> >  				if (addr == MAP_FAILED ||
> >  						addr != RTE_PTR_ADD(base_addr, offset)) {
> >  					RTE_LOG(ERR, EAL, "Could not mmap %s\n",
> >-						hp[i].filepath);
> >+						hp[index].filepath);
> >  					goto error;
> >  				}
> >  				offset+=mapping_size;
> >@@ -1543,6 +1548,7 @@ rte_eal_hugepage_attach(void)
> >  		RTE_LOG(DEBUG, EAL, "Mapped segment %u of size 0x%llx\n", s,
> >  				(unsigned long long)mcfg->memseg[s].len);
> >  		s++;
> >+		mapped_hp += num_hp;
> >  	}
> >  	/* unmap the hugepage config file, since we are done using it */
> >  	munmap((void *)(uintptr_t)hp, size);
> 

  reply	other threads:[~2016-03-22 17:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-07 14:13 Gowrishankar
2016-03-17  5:05 ` gowrishankar
2016-03-22 11:36   ` Thomas Monjalon
2016-03-22 12:11     ` Sergio Gonzalez Monroy
2016-03-22 16:35 ` Sergio Gonzalez Monroy
2016-03-22 17:10   ` Bruce Richardson [this message]
2016-05-20  3:03     ` Chao Zhu
2016-05-20  8:01       ` Sergio Gonzalez Monroy
2016-05-20  8:41         ` Chao Zhu
2016-05-20 10:25           ` Sergio Gonzalez Monroy
2017-02-15  8:51             ` Thomas Monjalon
2017-02-16  7:22               ` Chao Zhu
2018-04-15 12:28                 ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160322171046.GG20448@bricha3-MOBL3 \
    --to=bruce.richardson@intel.com \
    --cc=chaozhu@linux.vnet.ibm.com \
    --cc=david.marchand@6wind.com \
    --cc=dev@dpdk.org \
    --cc=gowrishankar.m@linux.vnet.ibm.com \
    --cc=sergio.gonzalez.monroy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).