From: "Chao Zhu" <chaozhu@linux.vnet.ibm.com>
To: "'Jonas Pfefferle1'" <JPF@zurich.ibm.com>
Cc: "'Burakov, Anatoly'" <anatoly.burakov@intel.com>,
<bruce.richardson@intel.com>, <dev@dpdk.org>
Subject: Re: [dpdk-dev] Huge mapping secondary process linux
Date: Thu, 9 Nov 2017 11:08:36 +0800 [thread overview]
Message-ID: <000a01d35908$0a362f50$1ea28df0$@linux.vnet.ibm.com> (raw)
In-Reply-To: <OFDDEE091E.3E3A55F9-ONC12581D1.0037299E-C12581D1.00385FED@notes.na.collabserv.com>
From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]
Sent: 2017年11月7日 18:16
To: Chao Zhu <chaozhu@linux.vnet.ibm.com>
Cc: 'Burakov, Anatoly' <anatoly.burakov@intel.com>; bruce.richardson@intel.com; dev@dpdk.org
Subject: RE: [dpdk-dev] Huge mapping secondary process linux
"Chao Zhu" <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> > wrote on 11/07/2017 09:25:26 AM:
> From: "Chao Zhu" <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> >
> To: "'Jonas Pfefferle1'" <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >, "'Burakov, Anatoly'"
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> Cc: <bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> >, <dev@dpdk.org <mailto:dev@dpdk.org> >
> Date: 11/07/2017 11:00 AM
> Subject: RE: [dpdk-dev] Huge mapping secondary process linux
>
>
>
> From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]
> Sent: 2017年10月28日 3:23
> To: Burakov, Anatoly <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> ; chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> ; dev@dpdk.org <mailto:dev@dpdk.org>
> Subject: Re: [dpdk-dev] Huge mapping secondary process linux
>
> "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 27/10/2017 18:00:27:
>
> > From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> > To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
> > Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , dev@dpdk.org <mailto:dev@dpdk.org>
> > Date: 27/10/2017 18:00
> > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
> >
> > On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote:
> > > "dev" <dev-bounces@dpdk.org <mailto:dev-bounces@dpdk.org> > wrote on 10/27/2017 04:58:01 PM:
> > >
> > > > From: "Jonas Pfefferle1" <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
> > > > To: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> > > > Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> ,
> dev@dpdk.org <mailto:dev@dpdk.org>
> > > > Date: 10/27/2017 04:58 PM
> > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
> > > > Sent by: "dev" <dev-bounces@dpdk.org <mailto:dev-bounces@dpdk.org> >
> > > >
> > > >
> > > > "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 10/27/2017
> > > 04:44:52
> > > > PM:
> > > >
> > > > > From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> > > > > To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
> > > > > Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> ,
> > > dev@dpdk.org <mailto:dev@dpdk.org>
> > > > > Date: 10/27/2017 04:45 PM
> > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
> > > > >
> > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote:
> > > > > > "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 10/27/2017
> > > > > > 04:06:44 PM:
> > > > > >
> > > > > > Â > From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
> > > > > > Â > To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >, dev@dpdk.org <mailto:dev@dpdk.org>
> > > > > > Â > Cc: chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , bruce.richardson@intel.com <mailto:bruce.richardson@intel.com>
> > > > > > Â > Date: 10/27/2017 04:06 PM
> > > > > > Â > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
> > > > > > Â >
> > > > > > Â > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote:
> > > > > > Â > >
> > > > > > Â > >
> > > > > > Â > > Hi @all,
> > > > > > Â > >
> > > > > > Â > > I'm trying to make sense of the hugepage memory mappings in
> > > > > > Â > > librte_eal/linuxapp/eal/eal_memory.c:
> > > > > > Â > > * In rte_eal_hugepage_attach (line 1347) when we try to do a
> > > > private
> > > > > > Â > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED
> > > if we
> > > >
> > > > > > need the
> > > > > > Â > > addresses to be identical with the primary process?
> > > > > > Â > > * On POWER we have this weird business going on where we use
> > > > > > MAP_HUGETLB
> > > > > > Â > > because according to this commit:
> > > > > > Â > >
> > > > > > Â > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440
> > > > > > Â > > Author: Chao Zhu <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> >
> > > > > > Â > > Date: Â Thu Apr 6 15:36:09 2017 +0530
> > > > > > Â > >
> > > > > > Â > > Â Â Â eal/ppc: fix mmap for memory initialization
> > > > > > Â > >
> > > > > > Â > > Â Â Â On IBM POWER platform, when mapping /dev/zero file to
> > > > hugepage
> > > > > > memory
> > > > > > Â > > Â Â Â space, mmap will not respect the requested address
> > > hint.This
> > > > will
> > > > > > Â > > cause
> > > > > > Â > > Â Â Â the memory initialization for the second
> > process fails.
> > > This
> > > > > > patch adds
> > > > > > Â > > Â Â Â the required mmap flags to make it work.
> > Beside this, users
> > > > > > need to set
> > > > > > Â > > Â Â Â the nr_overcommit_hugepages to expand the VA
> > range. When
> > > > > > Â > > Â Â Â doing the initialization, users need to set both
> > > nr_hugepages
> > > > and
> > > > > > Â > > Â Â Â nr_overcommit_hugepages to the same value, like 64,
> > > 128, etc.
> > > > > > Â > >
> > > > > > Â > > mmap address hints are not respected. Looking at the mmap
> > > code in
> > > > the
> > > > > > Â > > kernel this is not true entirely however under some
> > > circumstances
> > > > > > the hint
> > > > > > Â > > can be ignored (
> > > > > > Â > > https://urldefense.proofpoint.com/v2/url?
> > > > > > Â >
> > > > > >
> > > > >
> > > >
> > >
> >
> u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc_mm_mmap.c-23L103&d=DwICaQ&c=jf_iaSHvJObTbx-
> > > >
> > > > > > Â > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN-
> > > > > > Â > pXjigIjRW0&m=cttQcHlAYixhsYS3lz-
> > > > > > Â >
> > > >
> BAdEeg4dpbwGdPnj2R3I8Do0&s=Gp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&e=
> > > > > > Â > > ). However I believe we can remove the extra case
> forPPC if we
> > > > use
> > > > > > Â > > MAP_FIXED when doing the secondary process mappingsbecause we
> > > > need
> > > > > > them to
> > > > > > Â > > be identical anyway. We could also use MAP_FIXED
> whendoing the
> > > > primary
> > > > > > Â > > process mappings resp. get_virtual_area if we want
> to have any
> > > > > > guarantees
> > > > > > Â > > when specifying a base address. Any thoughts?
> > > > > > Â > >
> > > > > > Â > > Thanks,
> > > > > > Â > > Jonas
> > > > > > Â > >
> > > > > > Â > hi Jonas,
> > > > > > Â >
> > > > > > Â > MAP_FIXED is not used because it's dangerous, it
> unmaps anything
> > > > that is
> > > > > > Â > already mapped into that space. We would rather know
> > that we can't
> > > > map
> > > > > > Â > something than unwittingly unmap something that was
> > mapped before.
> > > > > >
> > > > > > Ok, I see. Maybe we can add a check to the primary process's memory
> > > > > > mappings whether the hint has been respected or not? At
> least warn if
> > > > it
> > > > > > hasn't.
> > > > >
> > > > > Hi Jonas,
> > > > >
> > > > > I'm unfamiliar with POWER platform, so i'm afraid you'd
> have to explain
> > > > > a bit more what you mean by "hint has been respected" :)
> > > >
> > > > Hi Anatoly,
> > > >
> > > > What I meant was the mmap address hint:
> > > >
> > > > "If addr is not NULL, then the kernel takes it as a hint
> > > > Â about where to place the mapping; on Linux, the mapping will be
> > > > Â created at a nearby page boundary."
> > > >
> > > > This is actually not true on POWER. It can happen that the address
> > > hint is
> > > > ignored and you get any address back that fits your mapping.
> > > >
> > > > Thanks,
> > > > Jonas
> > >
> > > Actually looking through the kernel code this is also not
> guaranteed on x86.
> > > (https://urldefense.proofpoint.com/v2/url?
> >
> u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_x86_kernel_sys-5Fx86-5F64.c-23L165&d=DwID-
> > g&c=jf_iaSHvJObTbx-siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN-
> >
> pXjigIjRW0&m=iqakzG7nSXLfvDHyS9IV5E9DWPnNcv19zcsl3MKMdvI&s=VqzZpcTaCUMmNieZ3WyUw-
> > jsnNP-hAcW487Mumv6xPw&e=)
> > >
> > > So in any case the address hint can be ignored by the kernel and you get
> > > any address that fits your mapping.
> > > My suggestion is to check when we do the initial mapping in
> > > get_virtual_area if the hint was respected or not, i.e. if the returned
> > > address == PAGE_ALIGN(address_hint).
> > >
> >
> > I'm not sure i see the issue here. So, just to make sure i understand
> > things correctly:
> >
> > Whenever we don't request a specific base address through base_address
> > EAL parameter, none of this matters - we always ask for memory in
> > arbitrary memory locations, correct?
> >
> > It's also not an issue with secondary processes because we do check
> > returned mmap address to see whether it's the same as we requested, correct?
> >
> > It's only whenever we *do* specify a base_address, we provide an address
> > hint to mmap to, but we don't check if the address we got from mmap is
> > one in the vicinity of our requested base address, correct? We don't
> > check, and the kernel can ignore address hint, so we're not guaranteed
> > to respect the base_address flag.
> >
> > I'm not sure this is a serious issue, because as far as i'm concerned,
> > this flag is advisory - we only promise to *attempt* to map things at
> > that particular address, not that it will succeed. If the kernel simply
> > cannot find an address to satisfy our address hint, or ignores it for
> > other reasons - well, tough, nothing we can do about that. I'm not sure
> > putting a check like this, where we can't even predict an "expected"
> > address is a good idea.
> >
> > Am i getting this right?
>
> The problem is when we specify a base address we want it to be used. If it is
> not respected we basically end up with the case like we would have
> never specified it.
> This very likely leads to not being able to run a secondary process because
> we will not be able to map the addresses from our primary process
> and that is why we
> introduced the base address parameter in the first place.
>
> >
> > --
> > Thanks,
> > Anatoly
> >
> The reason why I put the patch there is that when mapping hugepage
> on POWER, the kernel will never respect the address hints when doing
> mmap unless we expand the address space or unmap all the hugepages.
> This is a big difference when compared with x86. And it affects the
> mapping of the secondary process. I agree that the hints is
> advisory. Just want to see if there are better solutions.
This is not true. I looked through the kernel code and the address
hint is treated almost the same on both platforms:
PPC: <https://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mmap.c#L143> https://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mmap.c#L143
Line 169/170
x86: <https://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys_x86_64.c#L165> https://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys_x86_64.c#L165
Line 189/190
The only thing that might differ is the virtual address layout
(e.g. due to different page size etc) and that might lead to the same
value for base-virtaddr not working on both x86 and POWER.
However I tested with different address hints and you easily can
find addresses where the address hint is indeed respected.
That is also why I send in a patch to remove the HUGETLB flags on
the mmap.
Thanks,
Jonas
You can take a look at this. https://bugzilla.linux.ibm.com/show_bug.cgi?id=141628
It’s quite interesting.
next prev parent reply other threads:[~2017-11-09 3:08 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-27 12:43 Jonas Pfefferle1
2017-10-27 14:06 ` Burakov, Anatoly
2017-10-27 14:28 ` Jonas Pfefferle1
2017-10-27 14:44 ` Burakov, Anatoly
2017-10-27 14:58 ` Jonas Pfefferle1
2017-10-27 15:16 ` Jonas Pfefferle1
2017-10-27 16:00 ` Burakov, Anatoly
2017-10-27 19:22 ` Jonas Pfefferle1
2017-11-07 8:25 ` Chao Zhu
2017-11-07 10:15 ` Jonas Pfefferle1
2017-11-09 3:08 ` Chao Zhu [this message]
2017-11-09 9:54 ` Jonas Pfefferle1
2017-10-27 15:48 ` Tan, Jianfeng
2017-10-27 16:06 ` Burakov, Anatoly
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='000a01d35908$0a362f50$1ea28df0$@linux.vnet.ibm.com' \
--to=chaozhu@linux.vnet.ibm.com \
--cc=JPF@zurich.ibm.com \
--cc=anatoly.burakov@intel.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).