From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maxime.coquelin@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id 6B76A2BF4
 for <dev@dpdk.org>; Fri, 29 Mar 2019 14:36:12 +0100 (CET)
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 6489389C33;
 Fri, 29 Mar 2019 13:36:11 +0000 (UTC)
Received: from [10.36.112.59] (ovpn-112-59.ams2.redhat.com [10.36.112.59])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 15C37619F5;
 Fri, 29 Mar 2019 13:35:59 +0000 (UTC)
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
 Thomas Monjalon <thomas@monjalon.net>
Cc: David Marchand <david.marchand@redhat.com>, dev <dev@dpdk.org>,
 John McNamara <john.mcnamara@intel.com>,
 Marko Kovacevic <marko.kovacevic@intel.com>, iain.barker@oracle.com,
 edwin.leung@oracle.com
References: <07f664c33ddedaa5dcfe82ecb97d931e68b7e33a.1550855529.git.anatoly.burakov@intel.com>
 <1682850.JO3elT0QtZ@xps> <b6ce21eb-dae1-7858-a03a-6a5c1b6a35eb@intel.com>
 <3255576.YcZt162MTL@xps> <af1c5ca2-b309-f17a-fda5-88942e4090ac@intel.com>
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-ID: <c98e36ad-90dc-f8cf-01a2-01bbc7f4a86a@redhat.com>
Date: Fri, 29 Mar 2019 14:35:58 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.5.1
MIME-Version: 1.0
In-Reply-To: <af1c5ca2-b309-f17a-fda5-88942e4090ac@intel.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.27]); Fri, 29 Mar 2019 13:36:11 +0000 (UTC)
Subject: Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Mar 2019 13:36:12 -0000



On 3/29/19 2:24 PM, Burakov, Anatoly wrote:
> On 29-Mar-19 12:40 PM, Thomas Monjalon wrote:
>> 29/03/2019 13:05, Burakov, Anatoly:
>>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote:
>>>> 29/03/2019 11:33, Burakov, Anatoly:
>>>>> On 29-Mar-19 9:50 AM, David Marchand wrote:
>>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov
>>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
>>>>>>
>>>>>>       Due to internal glibc limitations [1], DPDK may exhaust 
>>>>>> internal
>>>>>>       file descriptor limits when using smaller page sizes, which 
>>>>>> results
>>>>>>       in inability to use system calls such as select() by user
>>>>>>       applications.
>>>>>>
>>>>>>       While the problem can be worked around using 
>>>>>> --single-file-segments
>>>>>>       option, it does not work if --legacy-mem mode is also used. 
>>>>>> Add a
>>>>>>       (yet another) EAL flag to disable storing fd's internally. This
>>>>>>       will sacrifice compability with Virtio with vhost-backend, but
>>>>>>       at least select() and friends will work.
>>>>>>
>>>>>>       [1] 
>>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html
>>>>>>
>>>>>>
>>>>>> Sorry, I am a bit lost and I never took the time to look in the new
>>>>>> memory allocation system.
>>>>>> This gives the impression that we are accumulating workarounds, 
>>>>>> between
>>>>>> legacy-mem, single-file-segments, now no-seg-fds.
>>>>>
>>>>> Yep. I don't like this any more than you do, but i think there are 
>>>>> users
>>>>> of all of these, so we can't just drop them willy-nilly. My great hope
>>>>> was that by now everyone would move on to use VFIO so legacy mem
>>>>> wouldn't be needed (the only reason it exists is to provide
>>>>> compatibility for use cases where lots of IOVA-contiguous memory is
>>>>> required, and VFIO cannot be used), but apparently that is too much to
>>>>> ask :/
>>>>>
>>>>>>
>>>>>> Iiuc, everything revolves around the need for per page locks.
>>>>>> Can you summarize why we need them?
>>>>>
>>>>> The short answer is multiprocess. We have to be able to map and unmap
>>>>> pages individually, and for that we need to be sure that we can, in
>>>>> fact, remove a page because no one else uses it. We also need to store
>>>>> fd's because virtio with vhost-user backend needs them to work, 
>>>>> because
>>>>> it relies on sharing memory between processes using fd's.

I guess you mean virtio-user.
Have you looked how Qemu does to share the guest memory with external
process like vhost-user backend? It works quite well with 2MB pages,
even with large VMs.

>>>>
>>>> It's a pity adding an option to workaround a limitation of a corner 
>>>> case.
>>>> It adds complexity that we will have to support forever,
>>>> and it's even not perfect because of vhost.
>>>>
>>>> Might there be another solution?
>>>>
>>>
>>> If there is one, i'm all ears. I don't see any solutions aside from
>>> adding limitations.
>>>
>>> For example, we could drop the single/multi file segments mode and just
>>> make single file segments a default and the only available mode, but
>>> this has certain risks because older kernels do not support fallocate()
>>> on hugetlbfs.
>>>
>>> We could further draw a line in the sand, and say that, for example,
>>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use
>>> VFIO by now and if you don't it's your own fault.
>>>
>>> We could also cut down on the number of fd's we use in single-file
>>> segments mode by not using locks and simply deleting pages in the
>>> primary, but yanking out hugepages from under secondaries' feet makes me
>>> feel uneasy, even if technically by the time that happens, they're not
>>> supposed to be used anyway. This could mean that the patch is no longer
>>> necessary because we don't use that many fd's any more.
>>
>> This last option is interesting. Is it realistic?
>>
> 
> I can do it in current release cycle, but i'm not sure if it's too late 
> to do such changes. I guess it's OK since the validation cycle is just 
> starting? I'll throw something together and see if it crashes and burns.
> 

Reducing the number of FDs is really important IMHO, as the application
using the DPDK library could also need several FDs for other purpose.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by dpdk.space (Postfix) with ESMTP id 7EDEDA05D3
	for <public@inbox.dpdk.org>; Fri, 29 Mar 2019 14:36:15 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 30C4D3576;
	Fri, 29 Mar 2019 14:36:14 +0100 (CET)
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id 6B76A2BF4
 for <dev@dpdk.org>; Fri, 29 Mar 2019 14:36:12 +0100 (CET)
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 6489389C33;
 Fri, 29 Mar 2019 13:36:11 +0000 (UTC)
Received: from [10.36.112.59] (ovpn-112-59.ams2.redhat.com [10.36.112.59])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 15C37619F5;
 Fri, 29 Mar 2019 13:35:59 +0000 (UTC)
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
 Thomas Monjalon <thomas@monjalon.net>
Cc: David Marchand <david.marchand@redhat.com>, dev <dev@dpdk.org>,
 John McNamara <john.mcnamara@intel.com>,
 Marko Kovacevic <marko.kovacevic@intel.com>, iain.barker@oracle.com,
 edwin.leung@oracle.com
References: <07f664c33ddedaa5dcfe82ecb97d931e68b7e33a.1550855529.git.anatoly.burakov@intel.com>
 <1682850.JO3elT0QtZ@xps> <b6ce21eb-dae1-7858-a03a-6a5c1b6a35eb@intel.com>
 <3255576.YcZt162MTL@xps> <af1c5ca2-b309-f17a-fda5-88942e4090ac@intel.com>
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-ID: <c98e36ad-90dc-f8cf-01a2-01bbc7f4a86a@redhat.com>
Date: Fri, 29 Mar 2019 14:35:58 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.5.1
MIME-Version: 1.0
In-Reply-To: <af1c5ca2-b309-f17a-fda5-88942e4090ac@intel.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.27]); Fri, 29 Mar 2019 13:36:11 +0000 (UTC)
Subject: Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>
Message-ID: <20190329133558.wZ2QXrcQtn3mPx1dGo_5QvmjDSL_P7ZhiUAgvVETpJU@z>



On 3/29/19 2:24 PM, Burakov, Anatoly wrote:
> On 29-Mar-19 12:40 PM, Thomas Monjalon wrote:
>> 29/03/2019 13:05, Burakov, Anatoly:
>>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote:
>>>> 29/03/2019 11:33, Burakov, Anatoly:
>>>>> On 29-Mar-19 9:50 AM, David Marchand wrote:
>>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov
>>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
>>>>>>
>>>>>>       Due to internal glibc limitations [1], DPDK may exhaust 
>>>>>> internal
>>>>>>       file descriptor limits when using smaller page sizes, which 
>>>>>> results
>>>>>>       in inability to use system calls such as select() by user
>>>>>>       applications.
>>>>>>
>>>>>>       While the problem can be worked around using 
>>>>>> --single-file-segments
>>>>>>       option, it does not work if --legacy-mem mode is also used. 
>>>>>> Add a
>>>>>>       (yet another) EAL flag to disable storing fd's internally. This
>>>>>>       will sacrifice compability with Virtio with vhost-backend, but
>>>>>>       at least select() and friends will work.
>>>>>>
>>>>>>       [1] 
>>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html
>>>>>>
>>>>>>
>>>>>> Sorry, I am a bit lost and I never took the time to look in the new
>>>>>> memory allocation system.
>>>>>> This gives the impression that we are accumulating workarounds, 
>>>>>> between
>>>>>> legacy-mem, single-file-segments, now no-seg-fds.
>>>>>
>>>>> Yep. I don't like this any more than you do, but i think there are 
>>>>> users
>>>>> of all of these, so we can't just drop them willy-nilly. My great hope
>>>>> was that by now everyone would move on to use VFIO so legacy mem
>>>>> wouldn't be needed (the only reason it exists is to provide
>>>>> compatibility for use cases where lots of IOVA-contiguous memory is
>>>>> required, and VFIO cannot be used), but apparently that is too much to
>>>>> ask :/
>>>>>
>>>>>>
>>>>>> Iiuc, everything revolves around the need for per page locks.
>>>>>> Can you summarize why we need them?
>>>>>
>>>>> The short answer is multiprocess. We have to be able to map and unmap
>>>>> pages individually, and for that we need to be sure that we can, in
>>>>> fact, remove a page because no one else uses it. We also need to store
>>>>> fd's because virtio with vhost-user backend needs them to work, 
>>>>> because
>>>>> it relies on sharing memory between processes using fd's.

I guess you mean virtio-user.
Have you looked how Qemu does to share the guest memory with external
process like vhost-user backend? It works quite well with 2MB pages,
even with large VMs.

>>>>
>>>> It's a pity adding an option to workaround a limitation of a corner 
>>>> case.
>>>> It adds complexity that we will have to support forever,
>>>> and it's even not perfect because of vhost.
>>>>
>>>> Might there be another solution?
>>>>
>>>
>>> If there is one, i'm all ears. I don't see any solutions aside from
>>> adding limitations.
>>>
>>> For example, we could drop the single/multi file segments mode and just
>>> make single file segments a default and the only available mode, but
>>> this has certain risks because older kernels do not support fallocate()
>>> on hugetlbfs.
>>>
>>> We could further draw a line in the sand, and say that, for example,
>>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use
>>> VFIO by now and if you don't it's your own fault.
>>>
>>> We could also cut down on the number of fd's we use in single-file
>>> segments mode by not using locks and simply deleting pages in the
>>> primary, but yanking out hugepages from under secondaries' feet makes me
>>> feel uneasy, even if technically by the time that happens, they're not
>>> supposed to be used anyway. This could mean that the patch is no longer
>>> necessary because we don't use that many fd's any more.
>>
>> This last option is interesting. Is it realistic?
>>
> 
> I can do it in current release cycle, but i'm not sure if it's too late 
> to do such changes. I guess it's OK since the validation cycle is just 
> starting? I'll throw something together and see if it crashes and burns.
> 

Reducing the number of FDs is really important IMHO, as the application
using the DPDK library could also need several FDs for other purpose.