From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by dpdk.org (Postfix) with ESMTP id D1FFF1B281 for ; Thu, 9 Nov 2017 04:08:47 +0100 (CET) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vA938dlX064674 for ; Wed, 8 Nov 2017 22:08:46 -0500 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2e49gwwsas-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 08 Nov 2017 22:08:45 -0500 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 9 Nov 2017 03:08:43 -0000 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 9 Nov 2017 03:08:40 -0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vA938dD437028018; Thu, 9 Nov 2017 03:08:39 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6250E11C04A; Thu, 9 Nov 2017 03:03:36 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B499011C052; Thu, 9 Nov 2017 03:03:33 +0000 (GMT) Received: from ADMINIB2M8Q79C (unknown [9.186.59.192]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Thu, 9 Nov 2017 03:03:33 +0000 (GMT) From: "Chao Zhu" To: "'Jonas Pfefferle1'" Cc: "'Burakov, Anatoly'" , , References: <921d836f-87dc-b017-2186-e70905f61612@intel.com> <003c01d357a1$f82ac4e0$e8804ea0$@linux.vnet.ibm.com> In-Reply-To: Date: Thu, 9 Nov 2017 11:08:36 +0800 MIME-Version: 1.0 X-Mailer: Microsoft Outlook 15.0 Content-Language: zh-cn Thread-Index: AQG2WROTXM69zTBoBA1jkQrk+5NycQJvhsP5Ak3SEzsCjgvuNQIiPiwaAsxKk+wC3chrEgKBOkwDAebi6F4B3sSfNaKaZ/Ig X-TM-AS-GCONF: 00 x-cbid: 17110903-0012-0000-0000-0000058B39A3 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17110903-0013-0000-0000-00001905E347 Message-Id: <000a01d35908$0a362f50$1ea28df0$@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-11-09_01:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1711090044 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] Huge mapping secondary process linux X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Nov 2017 03:08:48 -0000 =20 =20 From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]=20 Sent: 2017=E5=B9=B411=E6=9C=887=E6=97=A5 18:16 To: Chao Zhu Cc: 'Burakov, Anatoly' ; = bruce.richardson@intel.com; dev@dpdk.org Subject: RE: [dpdk-dev] Huge mapping secondary process linux =20 "Chao Zhu" > wrote on 11/07/2017 09:25:26 AM: > From: "Chao Zhu" > > To: "'Jonas Pfefferle1'" >, "'Burakov, Anatoly'"=20 > > > Cc: >, = > > Date: 11/07/2017 11:00 AM > Subject: RE: [dpdk-dev] Huge mapping secondary process linux >=20 > =20 > =20 > From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]=20 > Sent: 2017=E5=B9=B410=E6=9C=8828=E6=97=A5 3:23 > To: Burakov, Anatoly > > Cc: bruce.richardson@intel.com ; = chaozhu@linux.vnet.ibm.com ; = dev@dpdk.org =20 > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > =20 > "Burakov, Anatoly" > wrote on 27/10/2017 18:00:27: >=20 > > From: "Burakov, Anatoly" > > > To: Jonas Pfefferle1 = > > > Cc: bruce.richardson@intel.com , = chaozhu@linux.vnet.ibm.com , = dev@dpdk.org =20 > > Date: 27/10/2017 18:00 > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > >=20 > > On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote: > > > "dev" > wrote = on 10/27/2017 04:58:01 PM: > > >=20 > > > > From: "Jonas Pfefferle1" > > > > > To: "Burakov, Anatoly" > > > > > Cc: bruce.richardson@intel.com = , chaozhu@linux.vnet.ibm.com = ,=20 > dev@dpdk.org =20 > > > > Date: 10/27/2017 04:58 PM > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > Sent by: "dev" > > > > > > > > > > > > > "Burakov, Anatoly" > wrote on 10/27/2017=20 > > > 04:44:52 > > > > PM: > > > > > > > > > From: "Burakov, Anatoly" > > > > > > To: Jonas Pfefferle1 > > > > > > Cc: bruce.richardson@intel.com = , chaozhu@linux.vnet.ibm.com = ,=20 > > > dev@dpdk.org =20 > > > > > Date: 10/27/2017 04:45 PM > > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > > > > > "Burakov, Anatoly" > wrote on 10/27/2017 > > > > > > 04:06:44 PM: > > > > > > > > > > > > =C3=82 > From: "Burakov, Anatoly" = > > > > > > > =C3=82 > To: Jonas Pfefferle1 >, dev@dpdk.org =20 > > > > > > =C3=82 > Cc: chaozhu@linux.vnet.ibm.com = , bruce.richardson@intel.com = =20 > > > > > > =C3=82 > Date: 10/27/2017 04:06 PM > > > > > > =C3=82 > Subject: Re: [dpdk-dev] Huge mapping secondary = process linux > > > > > > =C3=82 > > > > > > > =C3=82 > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > =C3=82 > > > > > > > > =C3=82 > > > > > > > > =C3=82 > > Hi @all, > > > > > > =C3=82 > > > > > > > > =C3=82 > > I'm trying to make sense of the hugepage memory = mappings in > > > > > > =C3=82 > > librte_eal/linuxapp/eal/eal_memory.c: > > > > > > =C3=82 > > * In rte_eal_hugepage_attach (line 1347) when we = try to do a > > > > private > > > > > > =C3=82 > > mapping on /dev/zero (line 1393) why do we not = use MAP_FIXED=20 > > > if we > > > > > > > > > > need the > > > > > > =C3=82 > > addresses to be identical with the primary = process? > > > > > > =C3=82 > > * On POWER we have this weird business going on = where we use > > > > > > MAP_HUGETLB > > > > > > =C3=82 > > because according to this commit: > > > > > > =C3=82 > > > > > > > > =C3=82 > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > > > =C3=82 > > Author: Chao Zhu > > > > > > > =C3=82 > > Date: =C3=82 Thu Apr 6 15:36:09 2017 +0530 > > > > > > =C3=82 > > > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 eal/ppc: fix mmap for = memory initialization > > > > > > =C3=82 > > > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 On IBM POWER platform, = when mapping /dev/zero file to > > > > hugepage > > > > > > memory > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 space, mmap will not = respect the requested address=20 > > > hint.This > > > > will > > > > > > =C3=82 > > cause > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 the memory initialization = for the second=20 > > process fails.=20 > > > This > > > > > > patch adds > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 the required mmap flags = to make it work.=20 > > Beside this, users > > > > > > need to set > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 the = nr_overcommit_hugepages to expand the VA=20 > > range. When > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 doing the initialization, = users need to set both=20 > > > nr_hugepages > > > > and > > > > > > =C3=82 > > =C3=82 =C3=82 =C3=82 nr_overcommit_hugepages = to the same value, like 64,=20 > > > 128, etc. > > > > > > =C3=82 > > > > > > > > =C3=82 > > mmap address hints are not respected. Looking at = the mmap=20 > > > code in > > > > the > > > > > > =C3=82 > > kernel this is not true entirely however under = some=20 > > > circumstances > > > > > > the hint > > > > > > =C3=82 > > can be ignored ( > > > > > > =C3=82 > > https://urldefense.proofpoint.com/v2/url? > > > > > > =C3=82 > > > > > > > > > > > > > > > >=20 > > >=20 > >=20 > = u=3Dhttp-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc= _mm_mmap.c-23L103&d=3DDwICaQ&c=3Djf_iaSHvJObTbx- > > > > > > > > > > =C3=82 > siA1ZOg&r=3DrOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN- > > > > > > =C3=82 > pXjigIjRW0&m=3DcttQcHlAYixhsYS3lz- > > > > > > =C3=82 > > > > >=20 > = BAdEeg4dpbwGdPnj2R3I8Do0&s=3DGp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&= e=3D > > > > > > =C3=82 > > ). However I believe we can remove the extra = case=20 > forPPC if we > > > > use > > > > > > =C3=82 > > MAP_FIXED when doing the secondary process = mappingsbecause we > > > > need > > > > > > them to > > > > > > =C3=82 > > be identical anyway. We could also use MAP_FIXED = > whendoing the > > > > primary > > > > > > =C3=82 > > process mappings resp. get_virtual_area if we = want=20 > to have any > > > > > > guarantees > > > > > > =C3=82 > > when specifying a base address. Any thoughts? > > > > > > =C3=82 > > > > > > > > =C3=82 > > Thanks, > > > > > > =C3=82 > > Jonas > > > > > > =C3=82 > > > > > > > > =C3=82 > hi Jonas, > > > > > > =C3=82 > > > > > > > =C3=82 > MAP_FIXED is not used because it's dangerous, it=20 > unmaps anything > > > > that is > > > > > > =C3=82 > already mapped into that space. We would rather = know=20 > > that we can't > > > > map > > > > > > =C3=82 > something than unwittingly unmap something that = was=20 > > mapped before. > > > > > > > > > > > > Ok, I see. Maybe we can add a check to the primary = process's memory > > > > > > mappings whether the hint has been respected or not? At=20 > least warn if > > > > it > > > > > > hasn't. > > > > > > > > > > Hi Jonas, > > > > > > > > > > I'm unfamiliar with POWER platform, so i'm afraid you'd=20 > have to explain > > > > > a bit more what you mean by "hint has been respected" :) > > > > > > > > Hi Anatoly, > > > > > > > > What I meant was the mmap address hint: > > > > > > > > "If addr is not NULL, then the kernel takes it as a hint > > > > =C3=82 about where to place the mapping; on Linux, the mapping = will be > > > > =C3=82 created at a nearby page boundary." > > > > > > > > This is actually not true on POWER. It can happen that the = address=20 > > > hint is > > > > ignored and you get any address back that fits your mapping. > > > > > > > > Thanks, > > > > Jonas > > >=20 > > > Actually looking through the kernel code this is also not=20 > guaranteed on x86. > > > (https://urldefense.proofpoint.com/v2/url? > >=20 > = u=3Dhttp-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_x86_ker= nel_sys-5Fx86-5F64.c-23L165&d=3DDwID- > > g&c=3Djf_iaSHvJObTbx-siA1ZOg&r=3DrOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN- > >=20 > = pXjigIjRW0&m=3DiqakzG7nSXLfvDHyS9IV5E9DWPnNcv19zcsl3MKMdvI&s=3DVqzZpcTaCU= MmNieZ3WyUw- > > jsnNP-hAcW487Mumv6xPw&e=3D) > > >=20 > > > So in any case the address hint can be ignored by the kernel and = you get=20 > > > any address that fits your mapping. > > > My suggestion is to check when we do the initial mapping in=20 > > > get_virtual_area if the hint was respected or not, i.e. if the = returned=20 > > > address =3D=3D PAGE_ALIGN(address_hint). > > >=20 > >=20 > > I'm not sure i see the issue here. So, just to make sure i = understand=20 > > things correctly: > >=20 > > Whenever we don't request a specific base address through = base_address=20 > > EAL parameter, none of this matters - we always ask for memory in=20 > > arbitrary memory locations, correct? > >=20 > > It's also not an issue with secondary processes because we do check=20 > > returned mmap address to see whether it's the same as we requested, = correct? > >=20 > > It's only whenever we *do* specify a base_address, we provide an = address=20 > > hint to mmap to, but we don't check if the address we got from mmap = is=20 > > one in the vicinity of our requested base address, correct? We don't = > > check, and the kernel can ignore address hint, so we're not = guaranteed=20 > > to respect the base_address flag. > >=20 > > I'm not sure this is a serious issue, because as far as i'm = concerned,=20 > > this flag is advisory - we only promise to *attempt* to map things = at=20 > > that particular address, not that it will succeed. If the kernel = simply=20 > > cannot find an address to satisfy our address hint, or ignores it = for=20 > > other reasons - well, tough, nothing we can do about that. I'm not = sure=20 > > putting a check like this, where we can't even predict an "expected" = > > address is a good idea. > >=20 > > Am i getting this right? >=20 > The problem is when we specify a base address we want it to be used. = If it is > not respected we basically end up with the case like we would have=20 > never specified it. > This very likely leads to not being able to run a secondary process = because > we will not be able to map the addresses from our primary process=20 > and that is why we > introduced the base address parameter in the first place. >=20 > >=20 > > --=20 > > Thanks, > > Anatoly > >=20 > The reason why I put the patch there is that when mapping hugepage=20 > on POWER, the kernel will never respect the address hints when doing > mmap unless we expand the address space or unmap all the hugepages.=20 > This is a big difference when compared with x86. And it affects the=20 > mapping of the secondary process. I agree that the hints is=20 > advisory. Just want to see if there are better solutions. This is not true. I looked through the kernel code and the address hint is treated almost the same on both platforms:=20 PPC: = = https://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mma= p.c#L143 Line 169/170 x86: = = https://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys= _x86_64.c#L165 Line 189/190 The only thing that might differ is the virtual address layout (e.g. due to different page size etc) and that might lead to the same=20 value for base-virtaddr not working on both x86 and POWER. However I tested with different address hints and you easily can find addresses where the address hint is indeed respected.=20 That is also why I send in a patch to remove the HUGETLB flags on the mmap. Thanks, Jonas You can take a look at this. = https://bugzilla.linux.ibm.com/show_bug.cgi?id=3D141628 It=E2=80=99s quite interesting.