From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 6B76A2BF4 for ; Fri, 29 Mar 2019 14:36:12 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6489389C33; Fri, 29 Mar 2019 13:36:11 +0000 (UTC) Received: from [10.36.112.59] (ovpn-112-59.ams2.redhat.com [10.36.112.59]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 15C37619F5; Fri, 29 Mar 2019 13:35:59 +0000 (UTC) To: "Burakov, Anatoly" , Thomas Monjalon Cc: David Marchand , dev , John McNamara , Marko Kovacevic , iain.barker@oracle.com, edwin.leung@oracle.com References: <07f664c33ddedaa5dcfe82ecb97d931e68b7e33a.1550855529.git.anatoly.burakov@intel.com> <1682850.JO3elT0QtZ@xps> <3255576.YcZt162MTL@xps> From: Maxime Coquelin Message-ID: Date: Fri, 29 Mar 2019 14:35:58 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Fri, 29 Mar 2019 13:36:11 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Mar 2019 13:36:12 -0000 On 3/29/19 2:24 PM, Burakov, Anatoly wrote: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >> 29/03/2019 13:05, Burakov, Anatoly: >>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>> > wrote: >>>>>> >>>>>>       Due to internal glibc limitations [1], DPDK may exhaust >>>>>> internal >>>>>>       file descriptor limits when using smaller page sizes, which >>>>>> results >>>>>>       in inability to use system calls such as select() by user >>>>>>       applications. >>>>>> >>>>>>       While the problem can be worked around using >>>>>> --single-file-segments >>>>>>       option, it does not work if --legacy-mem mode is also used. >>>>>> Add a >>>>>>       (yet another) EAL flag to disable storing fd's internally. This >>>>>>       will sacrifice compability with Virtio with vhost-backend, but >>>>>>       at least select() and friends will work. >>>>>> >>>>>>       [1] >>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>> >>>>>> >>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>> memory allocation system. >>>>>> This gives the impression that we are accumulating workarounds, >>>>>> between >>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>> >>>>> Yep. I don't like this any more than you do, but i think there are >>>>> users >>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>> wouldn't be needed (the only reason it exists is to provide >>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>> ask :/ >>>>> >>>>>> >>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>> Can you summarize why we need them? >>>>> >>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>> pages individually, and for that we need to be sure that we can, in >>>>> fact, remove a page because no one else uses it. We also need to store >>>>> fd's because virtio with vhost-user backend needs them to work, >>>>> because >>>>> it relies on sharing memory between processes using fd's. I guess you mean virtio-user. Have you looked how Qemu does to share the guest memory with external process like vhost-user backend? It works quite well with 2MB pages, even with large VMs. >>>> >>>> It's a pity adding an option to workaround a limitation of a corner >>>> case. >>>> It adds complexity that we will have to support forever, >>>> and it's even not perfect because of vhost. >>>> >>>> Might there be another solution? >>>> >>> >>> If there is one, i'm all ears. I don't see any solutions aside from >>> adding limitations. >>> >>> For example, we could drop the single/multi file segments mode and just >>> make single file segments a default and the only available mode, but >>> this has certain risks because older kernels do not support fallocate() >>> on hugetlbfs. >>> >>> We could further draw a line in the sand, and say that, for example, >>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>> VFIO by now and if you don't it's your own fault. >>> >>> We could also cut down on the number of fd's we use in single-file >>> segments mode by not using locks and simply deleting pages in the >>> primary, but yanking out hugepages from under secondaries' feet makes me >>> feel uneasy, even if technically by the time that happens, they're not >>> supposed to be used anyway. This could mean that the patch is no longer >>> necessary because we don't use that many fd's any more. >> >> This last option is interesting. Is it realistic? >> > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. > Reducing the number of FDs is really important IMHO, as the application using the DPDK library could also need several FDs for other purpose. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 7EDEDA05D3 for ; Fri, 29 Mar 2019 14:36:15 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 30C4D3576; Fri, 29 Mar 2019 14:36:14 +0100 (CET) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 6B76A2BF4 for ; Fri, 29 Mar 2019 14:36:12 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6489389C33; Fri, 29 Mar 2019 13:36:11 +0000 (UTC) Received: from [10.36.112.59] (ovpn-112-59.ams2.redhat.com [10.36.112.59]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 15C37619F5; Fri, 29 Mar 2019 13:35:59 +0000 (UTC) To: "Burakov, Anatoly" , Thomas Monjalon Cc: David Marchand , dev , John McNamara , Marko Kovacevic , iain.barker@oracle.com, edwin.leung@oracle.com References: <07f664c33ddedaa5dcfe82ecb97d931e68b7e33a.1550855529.git.anatoly.burakov@intel.com> <1682850.JO3elT0QtZ@xps> <3255576.YcZt162MTL@xps> From: Maxime Coquelin Message-ID: Date: Fri, 29 Mar 2019 14:35:58 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Fri, 29 Mar 2019 13:36:11 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Message-ID: <20190329133558.wZ2QXrcQtn3mPx1dGo_5QvmjDSL_P7ZhiUAgvVETpJU@z> On 3/29/19 2:24 PM, Burakov, Anatoly wrote: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >> 29/03/2019 13:05, Burakov, Anatoly: >>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>> > wrote: >>>>>> >>>>>>       Due to internal glibc limitations [1], DPDK may exhaust >>>>>> internal >>>>>>       file descriptor limits when using smaller page sizes, which >>>>>> results >>>>>>       in inability to use system calls such as select() by user >>>>>>       applications. >>>>>> >>>>>>       While the problem can be worked around using >>>>>> --single-file-segments >>>>>>       option, it does not work if --legacy-mem mode is also used. >>>>>> Add a >>>>>>       (yet another) EAL flag to disable storing fd's internally. This >>>>>>       will sacrifice compability with Virtio with vhost-backend, but >>>>>>       at least select() and friends will work. >>>>>> >>>>>>       [1] >>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>> >>>>>> >>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>> memory allocation system. >>>>>> This gives the impression that we are accumulating workarounds, >>>>>> between >>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>> >>>>> Yep. I don't like this any more than you do, but i think there are >>>>> users >>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>> wouldn't be needed (the only reason it exists is to provide >>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>> ask :/ >>>>> >>>>>> >>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>> Can you summarize why we need them? >>>>> >>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>> pages individually, and for that we need to be sure that we can, in >>>>> fact, remove a page because no one else uses it. We also need to store >>>>> fd's because virtio with vhost-user backend needs them to work, >>>>> because >>>>> it relies on sharing memory between processes using fd's. I guess you mean virtio-user. Have you looked how Qemu does to share the guest memory with external process like vhost-user backend? It works quite well with 2MB pages, even with large VMs. >>>> >>>> It's a pity adding an option to workaround a limitation of a corner >>>> case. >>>> It adds complexity that we will have to support forever, >>>> and it's even not perfect because of vhost. >>>> >>>> Might there be another solution? >>>> >>> >>> If there is one, i'm all ears. I don't see any solutions aside from >>> adding limitations. >>> >>> For example, we could drop the single/multi file segments mode and just >>> make single file segments a default and the only available mode, but >>> this has certain risks because older kernels do not support fallocate() >>> on hugetlbfs. >>> >>> We could further draw a line in the sand, and say that, for example, >>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>> VFIO by now and if you don't it's your own fault. >>> >>> We could also cut down on the number of fd's we use in single-file >>> segments mode by not using locks and simply deleting pages in the >>> primary, but yanking out hugepages from under secondaries' feet makes me >>> feel uneasy, even if technically by the time that happens, they're not >>> supposed to be used anyway. This could mean that the patch is no longer >>> necessary because we don't use that many fd's any more. >> >> This last option is interesting. Is it realistic? >> > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. > Reducing the number of FDs is really important IMHO, as the application using the DPDK library could also need several FDs for other purpose.