* [dpdk-dev] Question about DPDK hugepage fd change
@ 2019-02-05 18:56 Iain Barker
2019-02-05 20:29 ` Wiles, Keith
0 siblings, 1 reply; 13+ messages in thread
From: Iain Barker @ 2019-02-05 18:56 UTC (permalink / raw)
To: dev; +Cc: edwin.leung
Hi everyone,
We just updated our application from DPDK 17.11.4 (LTS) to DPDK 18.11 (LTS) and we noticed a regression.
Our host platform is providing 2MB huge pages, so for 8GB reservation this means 4000 pages are allocated.
This worked fine in the prior LTS, but after upgrading DPDK what we are seeing is that select() on an fd is failing.
select() works fine when the process starts up, but does not work after DPDK has been initialized.
We did some investigation and found in the DPDK patches linked below, the hugepage tracking mechanism was changed from mmap to an array of file descriptors, and the rlimit for fd's is raised from the default to allow more fd's to be open.
https://mails.dpdk.org/archives/dev/2018-September/110890.html
https://mails.dpdk.org/archives/dev/2018-September/110889.html
The problem is that the GNU C library (glibc) has a limit for the maximum fd passed to select(), and is hard-coded in the POSIX header file and libc at 1024 (and probably many other OS libraries too as a result).
Raising the rlimit for fd >1024 has undefined results, per the manpage:
http://man7.org/linux/man-pages/man2/select.2.html
An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET()
with a value of fd that is negative or is equal to or larger than
FD_SETSIZE will result in undefined behavior. Moreover, POSIX
requires fd to be a valid file descriptor.
The Linux kernel allows file descriptor sets of arbitrary size,
determining the length of the sets to be checked from the value of
nfds. However, in the glibc implementation, the fd_set type is fixed
in size.
Specifically, libc's header include/sys/select.h has an array of fd's which is FD_SETSIZE deep.
__fd_mask fds_bits[__FD_SETSIZE / __NFDBITS];
and usr/include/linux/posix_types.h is hard-coded with
#define __FD_SETSIZE 1024
As this define and array are in libc, they are used in many libraries on a Linux system. So to use setsize >1024 means recompiling OS libraries and any other package that needs to use FDs, or ensuring that no library used by the application ever calls select() on an fd set. That seems an unreasonable burden...
Any thoughts?
thanks,
Iain
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 18:56 Iain Barker
@ 2019-02-05 20:29 ` Wiles, Keith
2019-02-05 21:27 ` Iain Barker
0 siblings, 1 reply; 13+ messages in thread
From: Wiles, Keith @ 2019-02-05 20:29 UTC (permalink / raw)
To: Iain Barker; +Cc: dev, edwin.leung
> On Feb 5, 2019, at 12:56 PM, Iain Barker <iain.barker@oracle.com> wrote:
>
> Hi everyone,
>
> We just updated our application from DPDK 17.11.4 (LTS) to DPDK 18.11 (LTS) and we noticed a regression.
>
> Our host platform is providing 2MB huge pages, so for 8GB reservation this means 4000 pages are allocated.
>
> This worked fine in the prior LTS, but after upgrading DPDK what we are seeing is that select() on an fd is failing.
>
> select() works fine when the process starts up, but does not work after DPDK has been initialized.
>
> We did some investigation and found in the DPDK patches linked below, the hugepage tracking mechanism was changed from mmap to an array of file descriptors, and the rlimit for fd's is raised from the default to allow more fd's to be open.
>
> https://mails.dpdk.org/archives/dev/2018-September/110890.html
> https://mails.dpdk.org/archives/dev/2018-September/110889.html
>
> The problem is that the GNU C library (glibc) has a limit for the maximum fd passed to select(), and is hard-coded in the POSIX header file and libc at 1024 (and probably many other OS libraries too as a result).
>
> Raising the rlimit for fd >1024 has undefined results, per the manpage:
>
> http://man7.org/linux/man-pages/man2/select.2.html
> An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET()
> with a value of fd that is negative or is equal to or larger than
> FD_SETSIZE will result in undefined behavior. Moreover, POSIX
> requires fd to be a valid file descriptor.
>
> The Linux kernel allows file descriptor sets of arbitrary size,
> determining the length of the sets to be checked from the value of
> nfds. However, in the glibc implementation, the fd_set type is fixed
> in size.
>
> Specifically, libc's header include/sys/select.h has an array of fd's which is FD_SETSIZE deep.
> __fd_mask fds_bits[__FD_SETSIZE / __NFDBITS];
>
> and usr/include/linux/posix_types.h is hard-coded with
> #define __FD_SETSIZE 1024
>
> As this define and array are in libc, they are used in many libraries on a Linux system. So to use setsize >1024 means recompiling OS libraries and any other package that needs to use FDs, or ensuring that no library used by the application ever calls select() on an fd set. That seems an unreasonable burden...
>
> Any thoughts?
Would poll work here instead?
>
> thanks,
> Iain
Regards,
Keith
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 20:29 ` Wiles, Keith
@ 2019-02-05 21:27 ` Iain Barker
2019-02-05 21:36 ` Wiles, Keith
0 siblings, 1 reply; 13+ messages in thread
From: Iain Barker @ 2019-02-05 21:27 UTC (permalink / raw)
To: Wiles, Keith; +Cc: dev, edwin.leung
>
> Would poll work here instead?
Poll (or epoll) would definitely work - if we controlled the source and compilation of all the libraries that the application links against.
But an app doesn’t know how the libraries in the OS are implemented. We’d have no way to ensure select() isn’t called by a shared library - the first we would know is when the application randomly failed.
Seems pretty clear that the newer DPDK library is breaking the requirements of GNU libc to use less than 1024 file descriptors. The previous DPDK design was able to mmap the huge pages without requiring thousands of open file descriptors...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 21:27 ` Iain Barker
@ 2019-02-05 21:36 ` Wiles, Keith
2019-02-05 21:49 ` Iain Barker
0 siblings, 1 reply; 13+ messages in thread
From: Wiles, Keith @ 2019-02-05 21:36 UTC (permalink / raw)
To: Iain Barker; +Cc: dev, edwin.leung
> On Feb 5, 2019, at 3:27 PM, Iain Barker <iain.barker@oracle.com> wrote:
>
>>
>> Would poll work here instead?
>
> Poll (or epoll) would definitely work - if we controlled the source and compilation of all the libraries that the application links against.
>
> But an app doesn’t know how the libraries in the OS are implemented. We’d have no way to ensure select() isn’t called by a shared library - the first we would know is when the application randomly failed.
>
> Seems pretty clear that the newer DPDK library is breaking the requirements of GNU libc to use less than 1024 file descriptors. The previous DPDK design was able to mmap the huge pages without requiring thousands of open file descriptors…
Maybe I do not see the full problem here. If DPDK used poll instead of select it would solve the 1024 problem as poll has a high limit to the number of file descriptors at least that was my assumption. I do not see us sending these file ids to other applications using select. If the fd is sent to another application it must be passed via the kernel an be converted to that process fd value. Did I miss a use case here or just missing the point?
>
>
Regards,
Keith
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 21:36 ` Wiles, Keith
@ 2019-02-05 21:49 ` Iain Barker
2019-02-05 22:02 ` Wiles, Keith
0 siblings, 1 reply; 13+ messages in thread
From: Iain Barker @ 2019-02-05 21:49 UTC (permalink / raw)
To: Wiles, Keith; +Cc: dev, edwin.leung
>
> Maybe I do not see the full problem here. If DPDK used poll instead of select it would solve the 1024 problem as poll has a high limit to the number of file descriptors at least that was my assumption.
>>
Thanks Keith.
The issue is not whether DPDK is using poll or select on the fd’s.
The issue is that DPDK is raising the per-process number of fd’s above the maximum that glibc supports for select().
Therefore no other code within that process can reliably use select() on an fd set, because any file that is opened may get an fd number > 1024.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 21:49 ` Iain Barker
@ 2019-02-05 22:02 ` Wiles, Keith
2019-02-06 13:57 ` Iain Barker
0 siblings, 1 reply; 13+ messages in thread
From: Wiles, Keith @ 2019-02-05 22:02 UTC (permalink / raw)
To: Iain Barker; +Cc: dev, edwin.leung, Burakov, Anatoly
> On Feb 5, 2019, at 3:49 PM, Iain Barker <iain.barker@oracle.com> wrote:
>
>
>>
>> Maybe I do not see the full problem here. If DPDK used poll instead of select it would solve the 1024 problem as poll has a high limit to the number of file descriptors at least that was my assumption.
>>>
>
> Thanks Keith.
>
> The issue is not whether DPDK is using poll or select on the fd’s.
>
> The issue is that DPDK is raising the per-process number of fd’s above the maximum that glibc supports for select().
>
> Therefore no other code within that process can reliably use select() on an fd set, because any file that is opened may get an fd number > 1024.
I see now. Then it means the application must also use poll or we have to release these fd’s after used to free up some fd’s below 1024 (which I assume is not possible) or something else needs to happen. These are questions for Anatoly I guess.
Can you use 1G hugepages instead of 2M pages or a combo of the two, not sure how dpdk handles having both in the system?
>
Regards,
Keith
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-05 22:02 ` Wiles, Keith
@ 2019-02-06 13:57 ` Iain Barker
2019-02-07 11:15 ` Burakov, Anatoly
2019-02-22 17:08 ` Burakov, Anatoly
0 siblings, 2 replies; 13+ messages in thread
From: Iain Barker @ 2019-02-06 13:57 UTC (permalink / raw)
To: Wiles, Keith; +Cc: dev, Edwin Leung, Burakov, Anatoly
> Can you use 1G hugepages instead of 2M pages or a combo of the two, not sure how dpdk handles having both in the system?
Unfortunately, no. Some of our customer deployments are tenancies on KVM hosts and low-end appliances, which are not configurable by the end user to enable 1G huge pages.
I think we are going to have to revert this patch set from our build, as I don't see any other alternative for using DPDK 18 whilst remaining compliant to the POSIX/glibc requirements.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-06 13:57 ` Iain Barker
@ 2019-02-07 11:15 ` Burakov, Anatoly
2019-02-22 17:08 ` Burakov, Anatoly
1 sibling, 0 replies; 13+ messages in thread
From: Burakov, Anatoly @ 2019-02-07 11:15 UTC (permalink / raw)
To: Iain Barker, Wiles, Keith; +Cc: dev, Edwin Leung
On 06-Feb-19 1:57 PM, Iain Barker wrote:
>> Can you use 1G hugepages instead of 2M pages or a combo of the two, not sure how dpdk handles having both in the system?
>
> Unfortunately, no. Some of our customer deployments are tenancies on KVM hosts and low-end appliances, which are not configurable by the end user to enable 1G huge pages.
>
> I think we are going to have to revert this patch set from our build, as I don't see any other alternative for using DPDK 18 whilst remaining compliant to the POSIX/glibc requirements.
>
Yep, apologies for that. I think a new command-line flag to disable this
functionality should solve the issue, but we already have enough of those...
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-06 13:57 ` Iain Barker
2019-02-07 11:15 ` Burakov, Anatoly
@ 2019-02-22 17:08 ` Burakov, Anatoly
2019-02-27 13:57 ` Iain Barker
1 sibling, 1 reply; 13+ messages in thread
From: Burakov, Anatoly @ 2019-02-22 17:08 UTC (permalink / raw)
To: Iain Barker, Wiles, Keith; +Cc: dev, Edwin Leung
On 06-Feb-19 1:57 PM, Iain Barker wrote:
>> Can you use 1G hugepages instead of 2M pages or a combo of the two, not sure how dpdk handles having both in the system?
>
> Unfortunately, no. Some of our customer deployments are tenancies on KVM hosts and low-end appliances, which are not configurable by the end user to enable 1G huge pages.
>
> I think we are going to have to revert this patch set from our build, as I don't see any other alternative for using DPDK 18 whilst remaining compliant to the POSIX/glibc requirements.
>
I just realized that, unless you're using --legacy-mem switch, one other
way to alleviate the issue would be to use --single-file-segments
option. This will still store the fd's, however it will only do so per
memseg list, not per page. So, instead of 1000's of fd's with 2MB pages,
you'd end up with under 10. Hope this helps!
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-22 17:08 ` Burakov, Anatoly
@ 2019-02-27 13:57 ` Iain Barker
2019-02-27 18:02 ` Edwin Leung
0 siblings, 1 reply; 13+ messages in thread
From: Iain Barker @ 2019-02-27 13:57 UTC (permalink / raw)
To: Burakov, Anatoly, Wiles, Keith; +Cc: dev, Edwin Leung
Original Message from: Burakov, Anatoly [mailto:anatoly.burakov@intel.com]
>I just realized that, unless you're using --legacy-mem switch, one other
>way to alleviate the issue would be to use --single-file-segments
>option. This will still store the fd's, however it will only do so per
>memseg list, not per page. So, instead of 1000's of fd's with 2MB pages,
>you'd end up with under 10. Hope this helps!
Hi Anatoly,
Thanks for the update and suggestion. We did try using --single-file-segments previously. Although it lowers the amount of fd's allocated for tracking the segments as you noted, there is still a problem.
It seems that a .lock file is created for each huge page, not for each segment. So with 2MB pages the glibc limit of 1024 fd's is still exhausted quickly if there is ~2GB of 2MB huge pages.
Edwin can provide more details from his testing. In our case much sooner, as we already use >500 fd's for the application, just 1GB of 2MB huge pages is enough to hit the fd limit due to the .lock files.
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-27 13:57 ` Iain Barker
@ 2019-02-27 18:02 ` Edwin Leung
2019-02-28 10:36 ` Burakov, Anatoly
0 siblings, 1 reply; 13+ messages in thread
From: Edwin Leung @ 2019-02-27 18:02 UTC (permalink / raw)
To: Iain Barker, Burakov, Anatoly, Wiles, Keith; +Cc: dev
Hi Anatoly,
In my test for DPDK 18.11, I notice the following:
1. Using --legacy-mem switch, DPDK still opens 1 fd/huge page. In essence, it is the same with or without this switch.
2. Using --single-file-segments does reduce the open fd to 1. However, for each huge page that is in-use, a .lock file is opened. As a result, it still uses up a large number of fd's.
Thanks.
-- edwin
-----Original Message-----
From: Iain Barker
Sent: Wednesday, February 27, 2019 8:57 AM
To: Burakov, Anatoly <anatoly.burakov@intel.com>; Wiles, Keith <keith.wiles@intel.com>
Cc: dev@dpdk.org; Edwin Leung <edwin.leung@oracle.com>
Subject: RE: [dpdk-dev] Question about DPDK hugepage fd change
Original Message from: Burakov, Anatoly [mailto:anatoly.burakov@intel.com]
>I just realized that, unless you're using --legacy-mem switch, one
>other way to alleviate the issue would be to use --single-file-segments
>option. This will still store the fd's, however it will only do so per
>memseg list, not per page. So, instead of 1000's of fd's with 2MB
>pages, you'd end up with under 10. Hope this helps!
Hi Anatoly,
Thanks for the update and suggestion. We did try using --single-file-segments previously. Although it lowers the amount of fd's allocated for tracking the segments as you noted, there is still a problem.
It seems that a .lock file is created for each huge page, not for each segment. So with 2MB pages the glibc limit of 1024 fd's is still exhausted quickly if there is ~2GB of 2MB huge pages.
Edwin can provide more details from his testing. In our case much sooner, as we already use >500 fd's for the application, just 1GB of 2MB huge pages is enough to hit the fd limit due to the .lock files.
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [dpdk-dev] Question about DPDK hugepage fd change
2019-02-27 18:02 ` Edwin Leung
@ 2019-02-28 10:36 ` Burakov, Anatoly
0 siblings, 0 replies; 13+ messages in thread
From: Burakov, Anatoly @ 2019-02-28 10:36 UTC (permalink / raw)
To: Edwin Leung, Iain Barker, Wiles, Keith; +Cc: dev
On 27-Feb-19 6:02 PM, Edwin Leung wrote:
> Hi Anatoly,
>
> In my test for DPDK 18.11, I notice the following:
>
> 1. Using --legacy-mem switch, DPDK still opens 1 fd/huge page. In essence, it is the same with or without this switch.
>
> 2. Using --single-file-segments does reduce the open fd to 1. However, for each huge page that is in-use, a .lock file is opened. As a result, it still uses up a large number of fd's.
>
> Thanks.
> -- edwin
>
> -----Original Message-----
> From: Iain Barker
> Sent: Wednesday, February 27, 2019 8:57 AM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Wiles, Keith <keith.wiles@intel.com>
> Cc: dev@dpdk.org; Edwin Leung <edwin.leung@oracle.com>
> Subject: RE: [dpdk-dev] Question about DPDK hugepage fd change
>
> Original Message from: Burakov, Anatoly [mailto:anatoly.burakov@intel.com]
>
>> I just realized that, unless you're using --legacy-mem switch, one
>> other way to alleviate the issue would be to use --single-file-segments
>> option. This will still store the fd's, however it will only do so per
>> memseg list, not per page. So, instead of 1000's of fd's with 2MB
>> pages, you'd end up with under 10. Hope this helps!
>
> Hi Anatoly,
>
> Thanks for the update and suggestion. We did try using --single-file-segments previously. Although it lowers the amount of fd's allocated for tracking the segments as you noted, there is still a problem.
>
> It seems that a .lock file is created for each huge page, not for each segment. So with 2MB pages the glibc limit of 1024 fd's is still exhausted quickly if there is ~2GB of 2MB huge pages.
>
> Edwin can provide more details from his testing. In our case much sooner, as we already use >500 fd's for the application, just 1GB of 2MB huge pages is enough to hit the fd limit due to the .lock files.
>
> Thanks.
>
Right, i forgot about that. Thanks for noticing! :)
By the way, i've proposed a patch for 19.05 to address this issue. The
downside is that you'd be losing virtio with vhost-user backend support:
http://patches.dpdk.org/patch/50469/
It would be good if you tested it and reported back. Thanks!
(i should fix the wording of the documentation to avoid mentioning
--single-file-segments as a solution - i completely forgot that it
creates lock files...)
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-10-08 10:54 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1820650896.208393.1633616003335.ref@mail.yahoo.com>
2021-10-07 14:13 ` [dpdk-dev] Question about DPDK hugepage fd change Vijay Atreya
2019-02-05 18:56 Iain Barker
2019-02-05 20:29 ` Wiles, Keith
2019-02-05 21:27 ` Iain Barker
2019-02-05 21:36 ` Wiles, Keith
2019-02-05 21:49 ` Iain Barker
2019-02-05 22:02 ` Wiles, Keith
2019-02-06 13:57 ` Iain Barker
2019-02-07 11:15 ` Burakov, Anatoly
2019-02-22 17:08 ` Burakov, Anatoly
2019-02-27 13:57 ` Iain Barker
2019-02-27 18:02 ` Edwin Leung
2019-02-28 10:36 ` Burakov, Anatoly
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).