* [dpdk-dev] [PATCH] eal: add option to not store segment fd's @ 2019-02-22 17:12 Anatoly Burakov 2019-03-29 9:50 ` David Marchand ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Anatoly Burakov @ 2019-02-22 17:12 UTC (permalink / raw) To: dev; +Cc: John McNamara, Marko Kovacevic, iain.barker, edwin.leung Due to internal glibc limitations [1], DPDK may exhaust internal file descriptor limits when using smaller page sizes, which results in inability to use system calls such as select() by user applications. While the problem can be worked around using --single-file-segments option, it does not work if --legacy-mem mode is also used. Add a (yet another) EAL flag to disable storing fd's internally. This will sacrifice compability with Virtio with vhost-backend, but at least select() and friends will work. [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- doc/guides/linux_gsg/linux_eal_parameters.rst | 4 ++++ .../prog_guide/env_abstraction_layer.rst | 19 +++++++++++++++++++ lib/librte_eal/common/eal_internal_cfg.h | 4 ++++ lib/librte_eal/common/eal_options.h | 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 4 ++++ lib/librte_eal/linuxapp/eal/eal_memalloc.c | 19 ++++++++++++++++++- 6 files changed, 51 insertions(+), 1 deletion(-) diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst index c63f0f49a..d50a7067e 100644 --- a/doc/guides/linux_gsg/linux_eal_parameters.rst +++ b/doc/guides/linux_gsg/linux_eal_parameters.rst @@ -94,6 +94,10 @@ Memory-related options Free hugepages back to system exactly as they were originally allocated. +* ``--no-seg-fds`` + + Do not store segment file descriptors in EAL. + Other options ~~~~~~~~~~~~~ diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index 929d76dba..ad540f158 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -214,6 +214,25 @@ Normally, these options do not need to be changed. can later be mapped into that preallocated VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup. ++ Segment file descriptors + +On Linux, in most cases, EAL will store segment file descriptors in EAL. This +can become a problem when using smaller page sizes due to underlying limitations +of ``glibc`` library. For example, Linux API calls such as ``select()`` may not +work correctly because ``glibc`` does not support more than certain number of +file descriptors. + +There are several possible workarounds for this issue. One is to use +``--single-file-segments`` mode, as that mode will not use a file descriptor per +each page. This is the recommended way of solving this issue, as it keeps +compatibility with Virtio with vhost-user backend. This option is not available +when using ``--legacy-mem`` mode. + +The other option is to use ``--no-seg-fds`` command-line parameter, +to prevent EAL from storing any page file descriptors. This will break +compatibility with Virtio with vhost-user backend, but this option will work +with ``--legacy-mem`` mode. + Support for Externally Allocated Memory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index 60eaead8f..96596c6b6 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -63,6 +63,10 @@ struct internal_config { /**< true if storing all pages within single files (per-page-size, * per-node) non-legacy mode only. */ + volatile unsigned no_seg_fds; + /**< true if no segment file descriptors are to be stored internally + * by EAL. + */ volatile int syslog_facility; /**< facility passed to openlog() */ /** default interrupt mode for VFIO */ volatile enum rte_intr_mode vfio_intr_mode; diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h index 58ee9ae33..94e39aed8 100644 --- a/lib/librte_eal/common/eal_options.h +++ b/lib/librte_eal/common/eal_options.h @@ -67,6 +67,8 @@ enum { OPT_IOVA_MODE_NUM, #define OPT_MATCH_ALLOCATIONS "match-allocations" OPT_MATCH_ALLOCATIONS_NUM, +#define OPT_NO_SEG_FDS "no-seg-fds" + OPT_NO_SEG_FDS_NUM, OPT_LONG_MAX_NUM }; diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 13f401684..e8a98c505 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -519,6 +519,7 @@ eal_usage(const char *prgname) " --"OPT_LEGACY_MEM" Legacy memory mode (no dynamic allocation, contiguous segments)\n" " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" " --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n" + " --"OPT_NO_SEG_FDS" Do not store segment file descriptors in EAL\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if ( rte_application_usage_hook ) { @@ -815,6 +816,9 @@ eal_parse_args(int argc, char **argv) case OPT_MATCH_ALLOCATIONS_NUM: internal_config.match_allocations = 1; break; + case OPT_NO_SEG_FDS_NUM: + internal_config.no_seg_fds = 1; + break; default: if (opt < OPT_LONG_MIN_NUM && isprint(opt)) { diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index b6fb183db..420f82a54 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -1518,6 +1518,10 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) if (internal_config.single_file_segments) return -ENOTSUP; + /* no seg fds mode doesn't support segment fd's */ + if (internal_config.no_seg_fds) + return -ENOTSUP; + /* if list is not allocated, allocate it */ if (fd_list[list_idx].len == 0) { int len = mcfg->memsegs[list_idx].memseg_arr.len; @@ -1539,6 +1543,10 @@ eal_memalloc_set_seg_list_fd(int list_idx, int fd) if (!internal_config.single_file_segments) return -ENOTSUP; + /* no seg fds mode doesn't support segment fd's */ + if (internal_config.no_seg_fds) + return -ENOTSUP; + /* if list is not allocated, allocate it */ if (fd_list[list_idx].len == 0) { int len = mcfg->memsegs[list_idx].memseg_arr.len; @@ -1557,6 +1565,10 @@ eal_memalloc_get_seg_fd(int list_idx, int seg_idx) { int fd; + /* no seg fds mode doesn't support segment fd's */ + if (internal_config.no_seg_fds) + return -ENOTSUP; + if (internal_config.in_memory || internal_config.no_hugetlbfs) { #ifndef MEMFD_SUPPORTED /* in in-memory or no-huge mode, we rely on memfd support */ @@ -1614,6 +1626,10 @@ eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + /* no seg fds mode doesn't support segment fd's */ + if (internal_config.no_seg_fds) + return -ENOTSUP; + if (internal_config.in_memory || internal_config.no_hugetlbfs) { #ifndef MEMFD_SUPPORTED /* in in-memory or no-huge mode, we rely on memfd support */ @@ -1679,7 +1695,8 @@ eal_memalloc_init(void) } /* initialize all of the fd lists */ - if (rte_memseg_list_walk(fd_list_create_walk, NULL)) + if (!internal_config.no_seg_fds && + rte_memseg_list_walk(fd_list_create_walk, NULL)) return -1; return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-02-22 17:12 [dpdk-dev] [PATCH] eal: add option to not store segment fd's Anatoly Burakov @ 2019-03-29 9:50 ` David Marchand 2019-03-29 9:50 ` David Marchand 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code Anatoly Burakov 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode Anatoly Burakov 2 siblings, 2 replies; 25+ messages in thread From: David Marchand @ 2019-03-29 9:50 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, Thomas Monjalon On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov <anatoly.burakov@intel.com> wrote: > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > While the problem can be worked around using --single-file-segments > option, it does not work if --legacy-mem mode is also used. Add a > (yet another) EAL flag to disable storing fd's internally. This > will sacrifice compability with Virtio with vhost-backend, but > at least select() and friends will work. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html Sorry, I am a bit lost and I never took the time to look in the new memory allocation system. This gives the impression that we are accumulating workarounds, between legacy-mem, single-file-segments, now no-seg-fds. Iiuc, everything revolves around the need for per page locks. Can you summarize why we need them? Thanks. -- David Marchand ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 9:50 ` David Marchand @ 2019-03-29 9:50 ` David Marchand 2019-03-29 10:33 ` Burakov, Anatoly 1 sibling, 0 replies; 25+ messages in thread From: David Marchand @ 2019-03-29 9:50 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, Thomas Monjalon On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov <anatoly.burakov@intel.com> wrote: > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > While the problem can be worked around using --single-file-segments > option, it does not work if --legacy-mem mode is also used. Add a > (yet another) EAL flag to disable storing fd's internally. This > will sacrifice compability with Virtio with vhost-backend, but > at least select() and friends will work. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html Sorry, I am a bit lost and I never took the time to look in the new memory allocation system. This gives the impression that we are accumulating workarounds, between legacy-mem, single-file-segments, now no-seg-fds. Iiuc, everything revolves around the need for per page locks. Can you summarize why we need them? Thanks. -- David Marchand ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 9:50 ` David Marchand 2019-03-29 9:50 ` David Marchand @ 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 11:34 ` Thomas Monjalon 1 sibling, 2 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 10:33 UTC (permalink / raw) To: David Marchand Cc: dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, Thomas Monjalon On 29-Mar-19 9:50 AM, David Marchand wrote: > On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > While the problem can be worked around using --single-file-segments > option, it does not work if --legacy-mem mode is also used. Add a > (yet another) EAL flag to disable storing fd's internally. This > will sacrifice compability with Virtio with vhost-backend, but > at least select() and friends will work. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > > Sorry, I am a bit lost and I never took the time to look in the new > memory allocation system. > This gives the impression that we are accumulating workarounds, between > legacy-mem, single-file-segments, now no-seg-fds. Yep. I don't like this any more than you do, but i think there are users of all of these, so we can't just drop them willy-nilly. My great hope was that by now everyone would move on to use VFIO so legacy mem wouldn't be needed (the only reason it exists is to provide compatibility for use cases where lots of IOVA-contiguous memory is required, and VFIO cannot be used), but apparently that is too much to ask :/ > > Iiuc, everything revolves around the need for per page locks. > Can you summarize why we need them? The short answer is multiprocess. We have to be able to map and unmap pages individually, and for that we need to be sure that we can, in fact, remove a page because no one else uses it. We also need to store fd's because virtio with vhost-user backend needs them to work, because it relies on sharing memory between processes using fd's. > > Thanks. > > -- > David Marchand -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 10:33 ` Burakov, Anatoly @ 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 11:34 ` Thomas Monjalon 1 sibling, 0 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 10:33 UTC (permalink / raw) To: David Marchand Cc: dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, Thomas Monjalon On 29-Mar-19 9:50 AM, David Marchand wrote: > On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > While the problem can be worked around using --single-file-segments > option, it does not work if --legacy-mem mode is also used. Add a > (yet another) EAL flag to disable storing fd's internally. This > will sacrifice compability with Virtio with vhost-backend, but > at least select() and friends will work. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > > Sorry, I am a bit lost and I never took the time to look in the new > memory allocation system. > This gives the impression that we are accumulating workarounds, between > legacy-mem, single-file-segments, now no-seg-fds. Yep. I don't like this any more than you do, but i think there are users of all of these, so we can't just drop them willy-nilly. My great hope was that by now everyone would move on to use VFIO so legacy mem wouldn't be needed (the only reason it exists is to provide compatibility for use cases where lots of IOVA-contiguous memory is required, and VFIO cannot be used), but apparently that is too much to ask :/ > > Iiuc, everything revolves around the need for per page locks. > Can you summarize why we need them? The short answer is multiprocess. We have to be able to map and unmap pages individually, and for that we need to be sure that we can, in fact, remove a page because no one else uses it. We also need to store fd's because virtio with vhost-user backend needs them to work, because it relies on sharing memory between processes using fd's. > > Thanks. > > -- > David Marchand -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 10:33 ` Burakov, Anatoly @ 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 12:05 ` Burakov, Anatoly 1 sibling, 2 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 11:34 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 11:33, Burakov, Anatoly: > On 29-Mar-19 9:50 AM, David Marchand wrote: > > On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > > > > Due to internal glibc limitations [1], DPDK may exhaust internal > > file descriptor limits when using smaller page sizes, which results > > in inability to use system calls such as select() by user > > applications. > > > > While the problem can be worked around using --single-file-segments > > option, it does not work if --legacy-mem mode is also used. Add a > > (yet another) EAL flag to disable storing fd's internally. This > > will sacrifice compability with Virtio with vhost-backend, but > > at least select() and friends will work. > > > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > > > > > Sorry, I am a bit lost and I never took the time to look in the new > > memory allocation system. > > This gives the impression that we are accumulating workarounds, between > > legacy-mem, single-file-segments, now no-seg-fds. > > Yep. I don't like this any more than you do, but i think there are users > of all of these, so we can't just drop them willy-nilly. My great hope > was that by now everyone would move on to use VFIO so legacy mem > wouldn't be needed (the only reason it exists is to provide > compatibility for use cases where lots of IOVA-contiguous memory is > required, and VFIO cannot be used), but apparently that is too much to > ask :/ > > > > > Iiuc, everything revolves around the need for per page locks. > > Can you summarize why we need them? > > The short answer is multiprocess. We have to be able to map and unmap > pages individually, and for that we need to be sure that we can, in > fact, remove a page because no one else uses it. We also need to store > fd's because virtio with vhost-user backend needs them to work, because > it relies on sharing memory between processes using fd's. It's a pity adding an option to workaround a limitation of a corner case. It adds complexity that we will have to support forever, and it's even not perfect because of vhost. Might there be another solution? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 11:34 ` Thomas Monjalon @ 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 12:05 ` Burakov, Anatoly 1 sibling, 0 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 11:34 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 11:33, Burakov, Anatoly: > On 29-Mar-19 9:50 AM, David Marchand wrote: > > On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > > > > Due to internal glibc limitations [1], DPDK may exhaust internal > > file descriptor limits when using smaller page sizes, which results > > in inability to use system calls such as select() by user > > applications. > > > > While the problem can be worked around using --single-file-segments > > option, it does not work if --legacy-mem mode is also used. Add a > > (yet another) EAL flag to disable storing fd's internally. This > > will sacrifice compability with Virtio with vhost-backend, but > > at least select() and friends will work. > > > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > > > > > Sorry, I am a bit lost and I never took the time to look in the new > > memory allocation system. > > This gives the impression that we are accumulating workarounds, between > > legacy-mem, single-file-segments, now no-seg-fds. > > Yep. I don't like this any more than you do, but i think there are users > of all of these, so we can't just drop them willy-nilly. My great hope > was that by now everyone would move on to use VFIO so legacy mem > wouldn't be needed (the only reason it exists is to provide > compatibility for use cases where lots of IOVA-contiguous memory is > required, and VFIO cannot be used), but apparently that is too much to > ask :/ > > > > > Iiuc, everything revolves around the need for per page locks. > > Can you summarize why we need them? > > The short answer is multiprocess. We have to be able to map and unmap > pages individually, and for that we need to be sure that we can, in > fact, remove a page because no one else uses it. We also need to store > fd's because virtio with vhost-user backend needs them to work, because > it relies on sharing memory between processes using fd's. It's a pity adding an option to workaround a limitation of a corner case. It adds complexity that we will have to support forever, and it's even not perfect because of vhost. Might there be another solution? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 11:34 ` Thomas Monjalon @ 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:40 ` Thomas Monjalon 1 sibling, 2 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 12:05 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > 29/03/2019 11:33, Burakov, Anatoly: >> On 29-Mar-19 9:50 AM, David Marchand wrote: >>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>> >>> Due to internal glibc limitations [1], DPDK may exhaust internal >>> file descriptor limits when using smaller page sizes, which results >>> in inability to use system calls such as select() by user >>> applications. >>> >>> While the problem can be worked around using --single-file-segments >>> option, it does not work if --legacy-mem mode is also used. Add a >>> (yet another) EAL flag to disable storing fd's internally. This >>> will sacrifice compability with Virtio with vhost-backend, but >>> at least select() and friends will work. >>> >>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>> >>> >>> Sorry, I am a bit lost and I never took the time to look in the new >>> memory allocation system. >>> This gives the impression that we are accumulating workarounds, between >>> legacy-mem, single-file-segments, now no-seg-fds. >> >> Yep. I don't like this any more than you do, but i think there are users >> of all of these, so we can't just drop them willy-nilly. My great hope >> was that by now everyone would move on to use VFIO so legacy mem >> wouldn't be needed (the only reason it exists is to provide >> compatibility for use cases where lots of IOVA-contiguous memory is >> required, and VFIO cannot be used), but apparently that is too much to >> ask :/ >> >>> >>> Iiuc, everything revolves around the need for per page locks. >>> Can you summarize why we need them? >> >> The short answer is multiprocess. We have to be able to map and unmap >> pages individually, and for that we need to be sure that we can, in >> fact, remove a page because no one else uses it. We also need to store >> fd's because virtio with vhost-user backend needs them to work, because >> it relies on sharing memory between processes using fd's. > > It's a pity adding an option to workaround a limitation of a corner case. > It adds complexity that we will have to support forever, > and it's even not perfect because of vhost. > > Might there be another solution? > If there is one, i'm all ears. I don't see any solutions aside from adding limitations. For example, we could drop the single/multi file segments mode and just make single file segments a default and the only available mode, but this has certain risks because older kernels do not support fallocate() on hugetlbfs. We could further draw a line in the sand, and say that, for example, 19.11 (or 20.11) will not have legacy mem mode, and everyone should use VFIO by now and if you don't it's your own fault. We could also cut down on the number of fd's we use in single-file segments mode by not using locks and simply deleting pages in the primary, but yanking out hugepages from under secondaries' feet makes me feel uneasy, even if technically by the time that happens, they're not supposed to be used anyway. This could mean that the patch is no longer necessary because we don't use that many fd's any more. However, if we are to support all that we support now, the only option here is to pile on more workarounds. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 12:05 ` Burakov, Anatoly @ 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:40 ` Thomas Monjalon 1 sibling, 0 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 12:05 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > 29/03/2019 11:33, Burakov, Anatoly: >> On 29-Mar-19 9:50 AM, David Marchand wrote: >>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>> >>> Due to internal glibc limitations [1], DPDK may exhaust internal >>> file descriptor limits when using smaller page sizes, which results >>> in inability to use system calls such as select() by user >>> applications. >>> >>> While the problem can be worked around using --single-file-segments >>> option, it does not work if --legacy-mem mode is also used. Add a >>> (yet another) EAL flag to disable storing fd's internally. This >>> will sacrifice compability with Virtio with vhost-backend, but >>> at least select() and friends will work. >>> >>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>> >>> >>> Sorry, I am a bit lost and I never took the time to look in the new >>> memory allocation system. >>> This gives the impression that we are accumulating workarounds, between >>> legacy-mem, single-file-segments, now no-seg-fds. >> >> Yep. I don't like this any more than you do, but i think there are users >> of all of these, so we can't just drop them willy-nilly. My great hope >> was that by now everyone would move on to use VFIO so legacy mem >> wouldn't be needed (the only reason it exists is to provide >> compatibility for use cases where lots of IOVA-contiguous memory is >> required, and VFIO cannot be used), but apparently that is too much to >> ask :/ >> >>> >>> Iiuc, everything revolves around the need for per page locks. >>> Can you summarize why we need them? >> >> The short answer is multiprocess. We have to be able to map and unmap >> pages individually, and for that we need to be sure that we can, in >> fact, remove a page because no one else uses it. We also need to store >> fd's because virtio with vhost-user backend needs them to work, because >> it relies on sharing memory between processes using fd's. > > It's a pity adding an option to workaround a limitation of a corner case. > It adds complexity that we will have to support forever, > and it's even not perfect because of vhost. > > Might there be another solution? > If there is one, i'm all ears. I don't see any solutions aside from adding limitations. For example, we could drop the single/multi file segments mode and just make single file segments a default and the only available mode, but this has certain risks because older kernels do not support fallocate() on hugetlbfs. We could further draw a line in the sand, and say that, for example, 19.11 (or 20.11) will not have legacy mem mode, and everyone should use VFIO by now and if you don't it's your own fault. We could also cut down on the number of fd's we use in single-file segments mode by not using locks and simply deleting pages in the primary, but yanking out hugepages from under secondaries' feet makes me feel uneasy, even if technically by the time that happens, they're not supposed to be used anyway. This could mean that the patch is no longer necessary because we don't use that many fd's any more. However, if we are to support all that we support now, the only option here is to pile on more workarounds. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:05 ` Burakov, Anatoly @ 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 13:24 ` Burakov, Anatoly 1 sibling, 2 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 12:40 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 13:05, Burakov, Anatoly: > On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > > 29/03/2019 11:33, Burakov, Anatoly: > >> On 29-Mar-19 9:50 AM, David Marchand wrote: > >>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > >>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > >>> > >>> Due to internal glibc limitations [1], DPDK may exhaust internal > >>> file descriptor limits when using smaller page sizes, which results > >>> in inability to use system calls such as select() by user > >>> applications. > >>> > >>> While the problem can be worked around using --single-file-segments > >>> option, it does not work if --legacy-mem mode is also used. Add a > >>> (yet another) EAL flag to disable storing fd's internally. This > >>> will sacrifice compability with Virtio with vhost-backend, but > >>> at least select() and friends will work. > >>> > >>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > >>> > >>> > >>> Sorry, I am a bit lost and I never took the time to look in the new > >>> memory allocation system. > >>> This gives the impression that we are accumulating workarounds, between > >>> legacy-mem, single-file-segments, now no-seg-fds. > >> > >> Yep. I don't like this any more than you do, but i think there are users > >> of all of these, so we can't just drop them willy-nilly. My great hope > >> was that by now everyone would move on to use VFIO so legacy mem > >> wouldn't be needed (the only reason it exists is to provide > >> compatibility for use cases where lots of IOVA-contiguous memory is > >> required, and VFIO cannot be used), but apparently that is too much to > >> ask :/ > >> > >>> > >>> Iiuc, everything revolves around the need for per page locks. > >>> Can you summarize why we need them? > >> > >> The short answer is multiprocess. We have to be able to map and unmap > >> pages individually, and for that we need to be sure that we can, in > >> fact, remove a page because no one else uses it. We also need to store > >> fd's because virtio with vhost-user backend needs them to work, because > >> it relies on sharing memory between processes using fd's. > > > > It's a pity adding an option to workaround a limitation of a corner case. > > It adds complexity that we will have to support forever, > > and it's even not perfect because of vhost. > > > > Might there be another solution? > > > > If there is one, i'm all ears. I don't see any solutions aside from > adding limitations. > > For example, we could drop the single/multi file segments mode and just > make single file segments a default and the only available mode, but > this has certain risks because older kernels do not support fallocate() > on hugetlbfs. > > We could further draw a line in the sand, and say that, for example, > 19.11 (or 20.11) will not have legacy mem mode, and everyone should use > VFIO by now and if you don't it's your own fault. > > We could also cut down on the number of fd's we use in single-file > segments mode by not using locks and simply deleting pages in the > primary, but yanking out hugepages from under secondaries' feet makes me > feel uneasy, even if technically by the time that happens, they're not > supposed to be used anyway. This could mean that the patch is no longer > necessary because we don't use that many fd's any more. This last option is interesting. Is it realistic? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 12:40 ` Thomas Monjalon @ 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 13:24 ` Burakov, Anatoly 1 sibling, 0 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 12:40 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 13:05, Burakov, Anatoly: > On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > > 29/03/2019 11:33, Burakov, Anatoly: > >> On 29-Mar-19 9:50 AM, David Marchand wrote: > >>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > >>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > >>> > >>> Due to internal glibc limitations [1], DPDK may exhaust internal > >>> file descriptor limits when using smaller page sizes, which results > >>> in inability to use system calls such as select() by user > >>> applications. > >>> > >>> While the problem can be worked around using --single-file-segments > >>> option, it does not work if --legacy-mem mode is also used. Add a > >>> (yet another) EAL flag to disable storing fd's internally. This > >>> will sacrifice compability with Virtio with vhost-backend, but > >>> at least select() and friends will work. > >>> > >>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > >>> > >>> > >>> Sorry, I am a bit lost and I never took the time to look in the new > >>> memory allocation system. > >>> This gives the impression that we are accumulating workarounds, between > >>> legacy-mem, single-file-segments, now no-seg-fds. > >> > >> Yep. I don't like this any more than you do, but i think there are users > >> of all of these, so we can't just drop them willy-nilly. My great hope > >> was that by now everyone would move on to use VFIO so legacy mem > >> wouldn't be needed (the only reason it exists is to provide > >> compatibility for use cases where lots of IOVA-contiguous memory is > >> required, and VFIO cannot be used), but apparently that is too much to > >> ask :/ > >> > >>> > >>> Iiuc, everything revolves around the need for per page locks. > >>> Can you summarize why we need them? > >> > >> The short answer is multiprocess. We have to be able to map and unmap > >> pages individually, and for that we need to be sure that we can, in > >> fact, remove a page because no one else uses it. We also need to store > >> fd's because virtio with vhost-user backend needs them to work, because > >> it relies on sharing memory between processes using fd's. > > > > It's a pity adding an option to workaround a limitation of a corner case. > > It adds complexity that we will have to support forever, > > and it's even not perfect because of vhost. > > > > Might there be another solution? > > > > If there is one, i'm all ears. I don't see any solutions aside from > adding limitations. > > For example, we could drop the single/multi file segments mode and just > make single file segments a default and the only available mode, but > this has certain risks because older kernels do not support fallocate() > on hugetlbfs. > > We could further draw a line in the sand, and say that, for example, > 19.11 (or 20.11) will not have legacy mem mode, and everyone should use > VFIO by now and if you don't it's your own fault. > > We could also cut down on the number of fd's we use in single-file > segments mode by not using locks and simply deleting pages in the > primary, but yanking out hugepages from under secondaries' feet makes me > feel uneasy, even if technically by the time that happens, they're not > supposed to be used anyway. This could mean that the patch is no longer > necessary because we don't use that many fd's any more. This last option is interesting. Is it realistic? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 12:40 ` Thomas Monjalon @ 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:24 ` Burakov, Anatoly ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 13:24 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: > 29/03/2019 13:05, Burakov, Anatoly: >> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>> 29/03/2019 11:33, Burakov, Anatoly: >>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>> >>>>> Due to internal glibc limitations [1], DPDK may exhaust internal >>>>> file descriptor limits when using smaller page sizes, which results >>>>> in inability to use system calls such as select() by user >>>>> applications. >>>>> >>>>> While the problem can be worked around using --single-file-segments >>>>> option, it does not work if --legacy-mem mode is also used. Add a >>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>> at least select() and friends will work. >>>>> >>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>> >>>>> >>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>> memory allocation system. >>>>> This gives the impression that we are accumulating workarounds, between >>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>> >>>> Yep. I don't like this any more than you do, but i think there are users >>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>> was that by now everyone would move on to use VFIO so legacy mem >>>> wouldn't be needed (the only reason it exists is to provide >>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>> required, and VFIO cannot be used), but apparently that is too much to >>>> ask :/ >>>> >>>>> >>>>> Iiuc, everything revolves around the need for per page locks. >>>>> Can you summarize why we need them? >>>> >>>> The short answer is multiprocess. We have to be able to map and unmap >>>> pages individually, and for that we need to be sure that we can, in >>>> fact, remove a page because no one else uses it. We also need to store >>>> fd's because virtio with vhost-user backend needs them to work, because >>>> it relies on sharing memory between processes using fd's. >>> >>> It's a pity adding an option to workaround a limitation of a corner case. >>> It adds complexity that we will have to support forever, >>> and it's even not perfect because of vhost. >>> >>> Might there be another solution? >>> >> >> If there is one, i'm all ears. I don't see any solutions aside from >> adding limitations. >> >> For example, we could drop the single/multi file segments mode and just >> make single file segments a default and the only available mode, but >> this has certain risks because older kernels do not support fallocate() >> on hugetlbfs. >> >> We could further draw a line in the sand, and say that, for example, >> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >> VFIO by now and if you don't it's your own fault. >> >> We could also cut down on the number of fd's we use in single-file >> segments mode by not using locks and simply deleting pages in the >> primary, but yanking out hugepages from under secondaries' feet makes me >> feel uneasy, even if technically by the time that happens, they're not >> supposed to be used anyway. This could mean that the patch is no longer >> necessary because we don't use that many fd's any more. > > This last option is interesting. Is it realistic? > I can do it in current release cycle, but i'm not sure if it's too late to do such changes. I guess it's OK since the validation cycle is just starting? I'll throw something together and see if it crashes and burns. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:24 ` Burakov, Anatoly @ 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 13:35 ` Maxime Coquelin 2 siblings, 0 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 13:24 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: > 29/03/2019 13:05, Burakov, Anatoly: >> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>> 29/03/2019 11:33, Burakov, Anatoly: >>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>> >>>>> Due to internal glibc limitations [1], DPDK may exhaust internal >>>>> file descriptor limits when using smaller page sizes, which results >>>>> in inability to use system calls such as select() by user >>>>> applications. >>>>> >>>>> While the problem can be worked around using --single-file-segments >>>>> option, it does not work if --legacy-mem mode is also used. Add a >>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>> at least select() and friends will work. >>>>> >>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>> >>>>> >>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>> memory allocation system. >>>>> This gives the impression that we are accumulating workarounds, between >>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>> >>>> Yep. I don't like this any more than you do, but i think there are users >>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>> was that by now everyone would move on to use VFIO so legacy mem >>>> wouldn't be needed (the only reason it exists is to provide >>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>> required, and VFIO cannot be used), but apparently that is too much to >>>> ask :/ >>>> >>>>> >>>>> Iiuc, everything revolves around the need for per page locks. >>>>> Can you summarize why we need them? >>>> >>>> The short answer is multiprocess. We have to be able to map and unmap >>>> pages individually, and for that we need to be sure that we can, in >>>> fact, remove a page because no one else uses it. We also need to store >>>> fd's because virtio with vhost-user backend needs them to work, because >>>> it relies on sharing memory between processes using fd's. >>> >>> It's a pity adding an option to workaround a limitation of a corner case. >>> It adds complexity that we will have to support forever, >>> and it's even not perfect because of vhost. >>> >>> Might there be another solution? >>> >> >> If there is one, i'm all ears. I don't see any solutions aside from >> adding limitations. >> >> For example, we could drop the single/multi file segments mode and just >> make single file segments a default and the only available mode, but >> this has certain risks because older kernels do not support fallocate() >> on hugetlbfs. >> >> We could further draw a line in the sand, and say that, for example, >> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >> VFIO by now and if you don't it's your own fault. >> >> We could also cut down on the number of fd's we use in single-file >> segments mode by not using locks and simply deleting pages in the >> primary, but yanking out hugepages from under secondaries' feet makes me >> feel uneasy, even if technically by the time that happens, they're not >> supposed to be used anyway. This could mean that the patch is no longer >> necessary because we don't use that many fd's any more. > > This last option is interesting. Is it realistic? > I can do it in current release cycle, but i'm not sure if it's too late to do such changes. I guess it's OK since the validation cycle is just starting? I'll throw something together and see if it crashes and burns. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:24 ` Burakov, Anatoly @ 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 14:21 ` Burakov, Anatoly 2019-03-29 13:35 ` Maxime Coquelin 2 siblings, 2 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 13:34 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 14:24, Burakov, Anatoly: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: > > 29/03/2019 13:05, Burakov, Anatoly: > >> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > >>> 29/03/2019 11:33, Burakov, Anatoly: > >>>> On 29-Mar-19 9:50 AM, David Marchand wrote: > >>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > >>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > >>>>> > >>>>> Due to internal glibc limitations [1], DPDK may exhaust internal > >>>>> file descriptor limits when using smaller page sizes, which results > >>>>> in inability to use system calls such as select() by user > >>>>> applications. > >>>>> > >>>>> While the problem can be worked around using --single-file-segments > >>>>> option, it does not work if --legacy-mem mode is also used. Add a > >>>>> (yet another) EAL flag to disable storing fd's internally. This > >>>>> will sacrifice compability with Virtio with vhost-backend, but > >>>>> at least select() and friends will work. > >>>>> > >>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > >>>>> > >>>>> > >>>>> Sorry, I am a bit lost and I never took the time to look in the new > >>>>> memory allocation system. > >>>>> This gives the impression that we are accumulating workarounds, between > >>>>> legacy-mem, single-file-segments, now no-seg-fds. > >>>> > >>>> Yep. I don't like this any more than you do, but i think there are users > >>>> of all of these, so we can't just drop them willy-nilly. My great hope > >>>> was that by now everyone would move on to use VFIO so legacy mem > >>>> wouldn't be needed (the only reason it exists is to provide > >>>> compatibility for use cases where lots of IOVA-contiguous memory is > >>>> required, and VFIO cannot be used), but apparently that is too much to > >>>> ask :/ > >>>> > >>>>> > >>>>> Iiuc, everything revolves around the need for per page locks. > >>>>> Can you summarize why we need them? > >>>> > >>>> The short answer is multiprocess. We have to be able to map and unmap > >>>> pages individually, and for that we need to be sure that we can, in > >>>> fact, remove a page because no one else uses it. We also need to store > >>>> fd's because virtio with vhost-user backend needs them to work, because > >>>> it relies on sharing memory between processes using fd's. > >>> > >>> It's a pity adding an option to workaround a limitation of a corner case. > >>> It adds complexity that we will have to support forever, > >>> and it's even not perfect because of vhost. > >>> > >>> Might there be another solution? > >>> > >> > >> If there is one, i'm all ears. I don't see any solutions aside from > >> adding limitations. > >> > >> For example, we could drop the single/multi file segments mode and just > >> make single file segments a default and the only available mode, but > >> this has certain risks because older kernels do not support fallocate() > >> on hugetlbfs. > >> > >> We could further draw a line in the sand, and say that, for example, > >> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use > >> VFIO by now and if you don't it's your own fault. > >> > >> We could also cut down on the number of fd's we use in single-file > >> segments mode by not using locks and simply deleting pages in the > >> primary, but yanking out hugepages from under secondaries' feet makes me > >> feel uneasy, even if technically by the time that happens, they're not > >> supposed to be used anyway. This could mean that the patch is no longer > >> necessary because we don't use that many fd's any more. > > > > This last option is interesting. Is it realistic? > > > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. OK let's try that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:34 ` Thomas Monjalon @ 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 14:21 ` Burakov, Anatoly 1 sibling, 0 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-03-29 13:34 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin 29/03/2019 14:24, Burakov, Anatoly: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: > > 29/03/2019 13:05, Burakov, Anatoly: > >> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: > >>> 29/03/2019 11:33, Burakov, Anatoly: > >>>> On 29-Mar-19 9:50 AM, David Marchand wrote: > >>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov > >>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: > >>>>> > >>>>> Due to internal glibc limitations [1], DPDK may exhaust internal > >>>>> file descriptor limits when using smaller page sizes, which results > >>>>> in inability to use system calls such as select() by user > >>>>> applications. > >>>>> > >>>>> While the problem can be worked around using --single-file-segments > >>>>> option, it does not work if --legacy-mem mode is also used. Add a > >>>>> (yet another) EAL flag to disable storing fd's internally. This > >>>>> will sacrifice compability with Virtio with vhost-backend, but > >>>>> at least select() and friends will work. > >>>>> > >>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > >>>>> > >>>>> > >>>>> Sorry, I am a bit lost and I never took the time to look in the new > >>>>> memory allocation system. > >>>>> This gives the impression that we are accumulating workarounds, between > >>>>> legacy-mem, single-file-segments, now no-seg-fds. > >>>> > >>>> Yep. I don't like this any more than you do, but i think there are users > >>>> of all of these, so we can't just drop them willy-nilly. My great hope > >>>> was that by now everyone would move on to use VFIO so legacy mem > >>>> wouldn't be needed (the only reason it exists is to provide > >>>> compatibility for use cases where lots of IOVA-contiguous memory is > >>>> required, and VFIO cannot be used), but apparently that is too much to > >>>> ask :/ > >>>> > >>>>> > >>>>> Iiuc, everything revolves around the need for per page locks. > >>>>> Can you summarize why we need them? > >>>> > >>>> The short answer is multiprocess. We have to be able to map and unmap > >>>> pages individually, and for that we need to be sure that we can, in > >>>> fact, remove a page because no one else uses it. We also need to store > >>>> fd's because virtio with vhost-user backend needs them to work, because > >>>> it relies on sharing memory between processes using fd's. > >>> > >>> It's a pity adding an option to workaround a limitation of a corner case. > >>> It adds complexity that we will have to support forever, > >>> and it's even not perfect because of vhost. > >>> > >>> Might there be another solution? > >>> > >> > >> If there is one, i'm all ears. I don't see any solutions aside from > >> adding limitations. > >> > >> For example, we could drop the single/multi file segments mode and just > >> make single file segments a default and the only available mode, but > >> this has certain risks because older kernels do not support fallocate() > >> on hugetlbfs. > >> > >> We could further draw a line in the sand, and say that, for example, > >> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use > >> VFIO by now and if you don't it's your own fault. > >> > >> We could also cut down on the number of fd's we use in single-file > >> segments mode by not using locks and simply deleting pages in the > >> primary, but yanking out hugepages from under secondaries' feet makes me > >> feel uneasy, even if technically by the time that happens, they're not > >> supposed to be used anyway. This could mean that the patch is no longer > >> necessary because we don't use that many fd's any more. > > > > This last option is interesting. Is it realistic? > > > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. OK let's try that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 13:34 ` Thomas Monjalon @ 2019-03-29 14:21 ` Burakov, Anatoly 2019-03-29 14:21 ` Burakov, Anatoly 1 sibling, 1 reply; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 14:21 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 1:34 PM, Thomas Monjalon wrote: > 29/03/2019 14:24, Burakov, Anatoly: >> On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >>> 29/03/2019 13:05, Burakov, Anatoly: >>>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>>>> >>>>>>> Due to internal glibc limitations [1], DPDK may exhaust internal >>>>>>> file descriptor limits when using smaller page sizes, which results >>>>>>> in inability to use system calls such as select() by user >>>>>>> applications. >>>>>>> >>>>>>> While the problem can be worked around using --single-file-segments >>>>>>> option, it does not work if --legacy-mem mode is also used. Add a >>>>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>>>> at least select() and friends will work. >>>>>>> >>>>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>>> >>>>>>> >>>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>>> memory allocation system. >>>>>>> This gives the impression that we are accumulating workarounds, between >>>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>>> >>>>>> Yep. I don't like this any more than you do, but i think there are users >>>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>>> wouldn't be needed (the only reason it exists is to provide >>>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>>> ask :/ >>>>>> >>>>>>> >>>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>>> Can you summarize why we need them? >>>>>> >>>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>>> pages individually, and for that we need to be sure that we can, in >>>>>> fact, remove a page because no one else uses it. We also need to store >>>>>> fd's because virtio with vhost-user backend needs them to work, because >>>>>> it relies on sharing memory between processes using fd's. >>>>> >>>>> It's a pity adding an option to workaround a limitation of a corner case. >>>>> It adds complexity that we will have to support forever, >>>>> and it's even not perfect because of vhost. >>>>> >>>>> Might there be another solution? >>>>> >>>> >>>> If there is one, i'm all ears. I don't see any solutions aside from >>>> adding limitations. >>>> >>>> For example, we could drop the single/multi file segments mode and just >>>> make single file segments a default and the only available mode, but >>>> this has certain risks because older kernels do not support fallocate() >>>> on hugetlbfs. >>>> >>>> We could further draw a line in the sand, and say that, for example, >>>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>>> VFIO by now and if you don't it's your own fault. >>>> >>>> We could also cut down on the number of fd's we use in single-file >>>> segments mode by not using locks and simply deleting pages in the >>>> primary, but yanking out hugepages from under secondaries' feet makes me >>>> feel uneasy, even if technically by the time that happens, they're not >>>> supposed to be used anyway. This could mean that the patch is no longer >>>> necessary because we don't use that many fd's any more. >>> >>> This last option is interesting. Is it realistic? >>> >> >> I can do it in current release cycle, but i'm not sure if it's too late >> to do such changes. I guess it's OK since the validation cycle is just >> starting? I'll throw something together and see if it crashes and burns. > > OK let's try that. > Bear in mind though that this will not work for legacy mem mode, because it cannot use single file segments mode without significant rework of page allocation code. So, legacy mem mode will still have this issue, unless we make it non-compatible with virtio with vhost-user backend. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 14:21 ` Burakov, Anatoly @ 2019-03-29 14:21 ` Burakov, Anatoly 0 siblings, 0 replies; 25+ messages in thread From: Burakov, Anatoly @ 2019-03-29 14:21 UTC (permalink / raw) To: Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung, maxime.coquelin On 29-Mar-19 1:34 PM, Thomas Monjalon wrote: > 29/03/2019 14:24, Burakov, Anatoly: >> On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >>> 29/03/2019 13:05, Burakov, Anatoly: >>>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>>>> >>>>>>> Due to internal glibc limitations [1], DPDK may exhaust internal >>>>>>> file descriptor limits when using smaller page sizes, which results >>>>>>> in inability to use system calls such as select() by user >>>>>>> applications. >>>>>>> >>>>>>> While the problem can be worked around using --single-file-segments >>>>>>> option, it does not work if --legacy-mem mode is also used. Add a >>>>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>>>> at least select() and friends will work. >>>>>>> >>>>>>> [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>>> >>>>>>> >>>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>>> memory allocation system. >>>>>>> This gives the impression that we are accumulating workarounds, between >>>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>>> >>>>>> Yep. I don't like this any more than you do, but i think there are users >>>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>>> wouldn't be needed (the only reason it exists is to provide >>>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>>> ask :/ >>>>>> >>>>>>> >>>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>>> Can you summarize why we need them? >>>>>> >>>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>>> pages individually, and for that we need to be sure that we can, in >>>>>> fact, remove a page because no one else uses it. We also need to store >>>>>> fd's because virtio with vhost-user backend needs them to work, because >>>>>> it relies on sharing memory between processes using fd's. >>>>> >>>>> It's a pity adding an option to workaround a limitation of a corner case. >>>>> It adds complexity that we will have to support forever, >>>>> and it's even not perfect because of vhost. >>>>> >>>>> Might there be another solution? >>>>> >>>> >>>> If there is one, i'm all ears. I don't see any solutions aside from >>>> adding limitations. >>>> >>>> For example, we could drop the single/multi file segments mode and just >>>> make single file segments a default and the only available mode, but >>>> this has certain risks because older kernels do not support fallocate() >>>> on hugetlbfs. >>>> >>>> We could further draw a line in the sand, and say that, for example, >>>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>>> VFIO by now and if you don't it's your own fault. >>>> >>>> We could also cut down on the number of fd's we use in single-file >>>> segments mode by not using locks and simply deleting pages in the >>>> primary, but yanking out hugepages from under secondaries' feet makes me >>>> feel uneasy, even if technically by the time that happens, they're not >>>> supposed to be used anyway. This could mean that the patch is no longer >>>> necessary because we don't use that many fd's any more. >>> >>> This last option is interesting. Is it realistic? >>> >> >> I can do it in current release cycle, but i'm not sure if it's too late >> to do such changes. I guess it's OK since the validation cycle is just >> starting? I'll throw something together and see if it crashes and burns. > > OK let's try that. > Bear in mind though that this will not work for legacy mem mode, because it cannot use single file segments mode without significant rework of page allocation code. So, legacy mem mode will still have this issue, unless we make it non-compatible with virtio with vhost-user backend. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:34 ` Thomas Monjalon @ 2019-03-29 13:35 ` Maxime Coquelin 2019-03-29 13:35 ` Maxime Coquelin 2 siblings, 1 reply; 25+ messages in thread From: Maxime Coquelin @ 2019-03-29 13:35 UTC (permalink / raw) To: Burakov, Anatoly, Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung On 3/29/19 2:24 PM, Burakov, Anatoly wrote: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >> 29/03/2019 13:05, Burakov, Anatoly: >>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>>> >>>>>> Due to internal glibc limitations [1], DPDK may exhaust >>>>>> internal >>>>>> file descriptor limits when using smaller page sizes, which >>>>>> results >>>>>> in inability to use system calls such as select() by user >>>>>> applications. >>>>>> >>>>>> While the problem can be worked around using >>>>>> --single-file-segments >>>>>> option, it does not work if --legacy-mem mode is also used. >>>>>> Add a >>>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>>> at least select() and friends will work. >>>>>> >>>>>> [1] >>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>> >>>>>> >>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>> memory allocation system. >>>>>> This gives the impression that we are accumulating workarounds, >>>>>> between >>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>> >>>>> Yep. I don't like this any more than you do, but i think there are >>>>> users >>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>> wouldn't be needed (the only reason it exists is to provide >>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>> ask :/ >>>>> >>>>>> >>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>> Can you summarize why we need them? >>>>> >>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>> pages individually, and for that we need to be sure that we can, in >>>>> fact, remove a page because no one else uses it. We also need to store >>>>> fd's because virtio with vhost-user backend needs them to work, >>>>> because >>>>> it relies on sharing memory between processes using fd's. I guess you mean virtio-user. Have you looked how Qemu does to share the guest memory with external process like vhost-user backend? It works quite well with 2MB pages, even with large VMs. >>>> >>>> It's a pity adding an option to workaround a limitation of a corner >>>> case. >>>> It adds complexity that we will have to support forever, >>>> and it's even not perfect because of vhost. >>>> >>>> Might there be another solution? >>>> >>> >>> If there is one, i'm all ears. I don't see any solutions aside from >>> adding limitations. >>> >>> For example, we could drop the single/multi file segments mode and just >>> make single file segments a default and the only available mode, but >>> this has certain risks because older kernels do not support fallocate() >>> on hugetlbfs. >>> >>> We could further draw a line in the sand, and say that, for example, >>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>> VFIO by now and if you don't it's your own fault. >>> >>> We could also cut down on the number of fd's we use in single-file >>> segments mode by not using locks and simply deleting pages in the >>> primary, but yanking out hugepages from under secondaries' feet makes me >>> feel uneasy, even if technically by the time that happens, they're not >>> supposed to be used anyway. This could mean that the patch is no longer >>> necessary because we don't use that many fd's any more. >> >> This last option is interesting. Is it realistic? >> > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. > Reducing the number of FDs is really important IMHO, as the application using the DPDK library could also need several FDs for other purpose. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH] eal: add option to not store segment fd's 2019-03-29 13:35 ` Maxime Coquelin @ 2019-03-29 13:35 ` Maxime Coquelin 0 siblings, 0 replies; 25+ messages in thread From: Maxime Coquelin @ 2019-03-29 13:35 UTC (permalink / raw) To: Burakov, Anatoly, Thomas Monjalon Cc: David Marchand, dev, John McNamara, Marko Kovacevic, iain.barker, edwin.leung On 3/29/19 2:24 PM, Burakov, Anatoly wrote: > On 29-Mar-19 12:40 PM, Thomas Monjalon wrote: >> 29/03/2019 13:05, Burakov, Anatoly: >>> On 29-Mar-19 11:34 AM, Thomas Monjalon wrote: >>>> 29/03/2019 11:33, Burakov, Anatoly: >>>>> On 29-Mar-19 9:50 AM, David Marchand wrote: >>>>>> On Fri, Feb 22, 2019 at 6:12 PM Anatoly Burakov >>>>>> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote: >>>>>> >>>>>> Due to internal glibc limitations [1], DPDK may exhaust >>>>>> internal >>>>>> file descriptor limits when using smaller page sizes, which >>>>>> results >>>>>> in inability to use system calls such as select() by user >>>>>> applications. >>>>>> >>>>>> While the problem can be worked around using >>>>>> --single-file-segments >>>>>> option, it does not work if --legacy-mem mode is also used. >>>>>> Add a >>>>>> (yet another) EAL flag to disable storing fd's internally. This >>>>>> will sacrifice compability with Virtio with vhost-backend, but >>>>>> at least select() and friends will work. >>>>>> >>>>>> [1] >>>>>> https://mails.dpdk.org/archives/dev/2019-February/124386.html >>>>>> >>>>>> >>>>>> Sorry, I am a bit lost and I never took the time to look in the new >>>>>> memory allocation system. >>>>>> This gives the impression that we are accumulating workarounds, >>>>>> between >>>>>> legacy-mem, single-file-segments, now no-seg-fds. >>>>> >>>>> Yep. I don't like this any more than you do, but i think there are >>>>> users >>>>> of all of these, so we can't just drop them willy-nilly. My great hope >>>>> was that by now everyone would move on to use VFIO so legacy mem >>>>> wouldn't be needed (the only reason it exists is to provide >>>>> compatibility for use cases where lots of IOVA-contiguous memory is >>>>> required, and VFIO cannot be used), but apparently that is too much to >>>>> ask :/ >>>>> >>>>>> >>>>>> Iiuc, everything revolves around the need for per page locks. >>>>>> Can you summarize why we need them? >>>>> >>>>> The short answer is multiprocess. We have to be able to map and unmap >>>>> pages individually, and for that we need to be sure that we can, in >>>>> fact, remove a page because no one else uses it. We also need to store >>>>> fd's because virtio with vhost-user backend needs them to work, >>>>> because >>>>> it relies on sharing memory between processes using fd's. I guess you mean virtio-user. Have you looked how Qemu does to share the guest memory with external process like vhost-user backend? It works quite well with 2MB pages, even with large VMs. >>>> >>>> It's a pity adding an option to workaround a limitation of a corner >>>> case. >>>> It adds complexity that we will have to support forever, >>>> and it's even not perfect because of vhost. >>>> >>>> Might there be another solution? >>>> >>> >>> If there is one, i'm all ears. I don't see any solutions aside from >>> adding limitations. >>> >>> For example, we could drop the single/multi file segments mode and just >>> make single file segments a default and the only available mode, but >>> this has certain risks because older kernels do not support fallocate() >>> on hugetlbfs. >>> >>> We could further draw a line in the sand, and say that, for example, >>> 19.11 (or 20.11) will not have legacy mem mode, and everyone should use >>> VFIO by now and if you don't it's your own fault. >>> >>> We could also cut down on the number of fd's we use in single-file >>> segments mode by not using locks and simply deleting pages in the >>> primary, but yanking out hugepages from under secondaries' feet makes me >>> feel uneasy, even if technically by the time that happens, they're not >>> supposed to be used anyway. This could mean that the patch is no longer >>> necessary because we don't use that many fd's any more. >> >> This last option is interesting. Is it realistic? >> > > I can do it in current release cycle, but i'm not sure if it's too late > to do such changes. I guess it's OK since the validation cycle is just > starting? I'll throw something together and see if it crashes and burns. > Reducing the number of FDs is really important IMHO, as the application using the DPDK library could also need several FDs for other purpose. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code 2019-02-22 17:12 [dpdk-dev] [PATCH] eal: add option to not store segment fd's Anatoly Burakov 2019-03-29 9:50 ` David Marchand @ 2019-03-29 17:55 ` Anatoly Burakov 2019-03-29 17:55 ` Anatoly Burakov 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode Anatoly Burakov 2 siblings, 1 reply; 25+ messages in thread From: Anatoly Burakov @ 2019-03-29 17:55 UTC (permalink / raw) To: dev; +Cc: david.marchand, thomas, maxime.coquelin Currently, segment resizing code sits in one giant function which handles both in-memory and regular modes. Split them up into individual functions. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- lib/librte_eal/linux/eal/eal_memalloc.c | 72 +++++++++++++++---------- 1 file changed, 45 insertions(+), 27 deletions(-) diff --git a/lib/librte_eal/linux/eal/eal_memalloc.c b/lib/librte_eal/linux/eal/eal_memalloc.c index b6fb183db..e90fe6c95 100644 --- a/lib/librte_eal/linux/eal/eal_memalloc.c +++ b/lib/librte_eal/linux/eal/eal_memalloc.c @@ -425,37 +425,37 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, } static int -resize_hugefile(int fd, char *path, int list_idx, int seg_idx, +resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, + uint64_t page_sz, bool grow) +{ + int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | + FALLOC_FL_KEEP_SIZE; + int ret; + + /* grow or shrink the file */ + ret = fallocate(fd, flags, fa_offset, page_sz); + + if (ret < 0) { + RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", + __func__, + strerror(errno)); + return -1; + } + /* increase/decrease total segment count */ + fd_list[list_idx].count += (grow ? 1 : -1); + if (!grow && fd_list[list_idx].count == 0) { + close(fd_list[list_idx].memseg_list_fd); + fd_list[list_idx].memseg_list_fd = -1; + } + return 0; +} + +static int +resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, uint64_t fa_offset, uint64_t page_sz, bool grow) { bool again = false; - /* in-memory mode is a special case, because we don't need to perform - * any locking, and we can be sure that fallocate() is supported. - */ - if (internal_config.in_memory) { - int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | - FALLOC_FL_KEEP_SIZE; - int ret; - - /* grow or shrink the file */ - ret = fallocate(fd, flags, fa_offset, page_sz); - - if (ret < 0) { - RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", - __func__, - strerror(errno)); - return -1; - } - /* increase/decrease total segment count */ - fd_list[list_idx].count += (grow ? 1 : -1); - if (!grow && fd_list[list_idx].count == 0) { - close(fd_list[list_idx].memseg_list_fd); - fd_list[list_idx].memseg_list_fd = -1; - } - return 0; - } - do { if (fallocate_supported == 0) { /* we cannot deallocate memory if fallocate() is not @@ -583,9 +583,27 @@ resize_hugefile(int fd, char *path, int list_idx, int seg_idx, } } } while (again); + return 0; } + +static int +resize_hugefile(int fd, char *path, int list_idx, int seg_idx, + uint64_t fa_offset, uint64_t page_sz, bool grow) +{ + + /* in-memory mode is a special case, because we don't need to perform + * any locking, and we can be sure that fallocate() is supported. + */ + if (internal_config.in_memory) + return resize_hugefile_in_memory(fd, list_idx, fa_offset, + page_sz, grow); + + return resize_hugefile_in_filesystem(fd, path, list_idx, seg_idx, + fa_offset, page_sz, grow); +} + static int alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, struct hugepage_info *hi, unsigned int list_idx, -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code Anatoly Burakov @ 2019-03-29 17:55 ` Anatoly Burakov 0 siblings, 0 replies; 25+ messages in thread From: Anatoly Burakov @ 2019-03-29 17:55 UTC (permalink / raw) To: dev; +Cc: david.marchand, thomas, maxime.coquelin Currently, segment resizing code sits in one giant function which handles both in-memory and regular modes. Split them up into individual functions. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- lib/librte_eal/linux/eal/eal_memalloc.c | 72 +++++++++++++++---------- 1 file changed, 45 insertions(+), 27 deletions(-) diff --git a/lib/librte_eal/linux/eal/eal_memalloc.c b/lib/librte_eal/linux/eal/eal_memalloc.c index b6fb183db..e90fe6c95 100644 --- a/lib/librte_eal/linux/eal/eal_memalloc.c +++ b/lib/librte_eal/linux/eal/eal_memalloc.c @@ -425,37 +425,37 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, } static int -resize_hugefile(int fd, char *path, int list_idx, int seg_idx, +resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, + uint64_t page_sz, bool grow) +{ + int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | + FALLOC_FL_KEEP_SIZE; + int ret; + + /* grow or shrink the file */ + ret = fallocate(fd, flags, fa_offset, page_sz); + + if (ret < 0) { + RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", + __func__, + strerror(errno)); + return -1; + } + /* increase/decrease total segment count */ + fd_list[list_idx].count += (grow ? 1 : -1); + if (!grow && fd_list[list_idx].count == 0) { + close(fd_list[list_idx].memseg_list_fd); + fd_list[list_idx].memseg_list_fd = -1; + } + return 0; +} + +static int +resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, uint64_t fa_offset, uint64_t page_sz, bool grow) { bool again = false; - /* in-memory mode is a special case, because we don't need to perform - * any locking, and we can be sure that fallocate() is supported. - */ - if (internal_config.in_memory) { - int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | - FALLOC_FL_KEEP_SIZE; - int ret; - - /* grow or shrink the file */ - ret = fallocate(fd, flags, fa_offset, page_sz); - - if (ret < 0) { - RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", - __func__, - strerror(errno)); - return -1; - } - /* increase/decrease total segment count */ - fd_list[list_idx].count += (grow ? 1 : -1); - if (!grow && fd_list[list_idx].count == 0) { - close(fd_list[list_idx].memseg_list_fd); - fd_list[list_idx].memseg_list_fd = -1; - } - return 0; - } - do { if (fallocate_supported == 0) { /* we cannot deallocate memory if fallocate() is not @@ -583,9 +583,27 @@ resize_hugefile(int fd, char *path, int list_idx, int seg_idx, } } } while (again); + return 0; } + +static int +resize_hugefile(int fd, char *path, int list_idx, int seg_idx, + uint64_t fa_offset, uint64_t page_sz, bool grow) +{ + + /* in-memory mode is a special case, because we don't need to perform + * any locking, and we can be sure that fallocate() is supported. + */ + if (internal_config.in_memory) + return resize_hugefile_in_memory(fd, list_idx, fa_offset, + page_sz, grow); + + return resize_hugefile_in_filesystem(fd, path, list_idx, seg_idx, + fa_offset, page_sz, grow); +} + static int alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, struct hugepage_info *hi, unsigned int list_idx, -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode 2019-02-22 17:12 [dpdk-dev] [PATCH] eal: add option to not store segment fd's Anatoly Burakov 2019-03-29 9:50 ` David Marchand 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code Anatoly Burakov @ 2019-03-29 17:55 ` Anatoly Burakov 2019-03-29 17:55 ` Anatoly Burakov 2019-04-02 14:08 ` Thomas Monjalon 2 siblings, 2 replies; 25+ messages in thread From: Anatoly Burakov @ 2019-03-29 17:55 UTC (permalink / raw) To: dev Cc: John McNamara, Marko Kovacevic, david.marchand, thomas, maxime.coquelin Due to internal glibc limitations [1], DPDK may exhaust internal file descriptor limits when using smaller page sizes, which results in inability to use system calls such as select() by user applications. Single file segments option stores lock files per page to ensure that pages are deleted when there are no more users, however this is not necessary because the processes will be holding onto the pages anyway because of mmap(). Thus, removing pages from the filesystem is safe even though they may be used by some other secondary process. As a result, single file segments mode no longer stores inordinate amounts of segment fd's, and the above issue with fd limits is solved. However, this will not work for legacy mem mode. For that, simply document that using bigger page sizes is the only option. [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../prog_guide/env_abstraction_layer.rst | 18 ++ lib/librte_eal/common/eal_filesystem.h | 11 - lib/librte_eal/linux/eal/eal_memalloc.c | 285 +++++------------- 3 files changed, 96 insertions(+), 218 deletions(-) diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index 2361c3b8f..faa5f6a0d 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -214,6 +214,24 @@ Normally, these options do not need to be changed. can later be mapped into that preallocated VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup. ++ Segment file descriptors + +On Linux, in most cases, EAL will store segment file descriptors in EAL. This +can become a problem when using smaller page sizes due to underlying limitations +of ``glibc`` library. For example, Linux API calls such as ``select()`` may not +work correctly because ``glibc`` does not support more than certain number of +file descriptors. + +There are two possible solutions for this problem. The recommended solution is +to use ``--single-file-segments`` mode, as that mode will not use a file +descriptor per each page, and it will keep compatibility with Virtio with +vhost-user backend. This option is not available when using ``--legacy-mem`` +mode. + +Another option is to use bigger page sizes. Since fewer pages are required to +cover the same memory area, fewer file descriptors will be stored internally +by EAL. + Support for Externally Allocated Memory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h index 89a3added..f2f83e712 100644 --- a/lib/librte_eal/common/eal_filesystem.h +++ b/lib/librte_eal/common/eal_filesystem.h @@ -98,17 +98,6 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id return buffer; } -/** String format for hugepage map lock files. */ -#define HUGEFILE_LOCK_FMT "%s/map_%d.lock" -static inline const char * -eal_get_hugefile_lock_path(char *buffer, size_t buflen, int f_id) -{ - snprintf(buffer, buflen, HUGEFILE_LOCK_FMT, rte_eal_get_runtime_dir(), - f_id); - buffer[buflen - 1] = '\0'; - return buffer; -} - /** define the default filename prefix for the %s values above */ #define HUGEFILE_PREFIX_DEFAULT "rte" diff --git a/lib/librte_eal/linux/eal/eal_memalloc.c b/lib/librte_eal/linux/eal/eal_memalloc.c index e90fe6c95..4c51fb8ca 100644 --- a/lib/librte_eal/linux/eal/eal_memalloc.c +++ b/lib/librte_eal/linux/eal/eal_memalloc.c @@ -83,13 +83,8 @@ static int fallocate_supported = -1; /* unknown */ /* * we have two modes - single file segments, and file-per-page mode. * - * for single-file segments, we need some kind of mechanism to keep track of - * which hugepages can be freed back to the system, and which cannot. we cannot - * use flock() because they don't allow locking parts of a file, and we cannot - * use fcntl() due to issues with their semantics, so we will have to rely on a - * bunch of lockfiles for each page. so, we will use 'fds' array to keep track - * of per-page lockfiles. we will store the actual segment list fd in the - * 'memseg_list_fd' field. + * for single-file segments, we use memseg_list_fd to store the segment fd, + * while the fds[] will not be allocated, and len will be set to 0. * * for file-per-page mode, each page will have its own fd, so 'memseg_list_fd' * will be invalid (set to -1), and we'll use 'fds' to keep track of page fd's. @@ -245,80 +240,6 @@ static int lock(int fd, int type) return 1; } -static int get_segment_lock_fd(int list_idx, int seg_idx) -{ - char path[PATH_MAX] = {0}; - int fd; - - if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) - return -1; - if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) - return -1; - - fd = fd_list[list_idx].fds[seg_idx]; - /* does this lock already exist? */ - if (fd >= 0) - return fd; - - eal_get_hugefile_lock_path(path, sizeof(path), - list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); - - fd = open(path, O_CREAT | O_RDWR, 0660); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n", - __func__, path, strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) != 1) { - RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n", - __func__, path, strerror(errno)); - close(fd); - return -1; - } - /* store it for future reference */ - fd_list[list_idx].fds[seg_idx] = fd; - fd_list[list_idx].count++; - return fd; -} - -static int unlock_segment(int list_idx, int seg_idx) -{ - int fd, ret; - - if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) - return -1; - if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) - return -1; - - fd = fd_list[list_idx].fds[seg_idx]; - - /* upgrade lock to exclusive to see if we can remove the lockfile */ - ret = lock(fd, LOCK_EX); - if (ret == 1) { - /* we've succeeded in taking exclusive lock, this lockfile may - * be removed. - */ - char path[PATH_MAX] = {0}; - eal_get_hugefile_lock_path(path, sizeof(path), - list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); - if (unlink(path)) { - RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n", - __func__, path, strerror(errno)); - } - } - /* we don't want to leak the fd, so even if we fail to lock, close fd - * and remove it from list anyway. - */ - close(fd); - fd_list[list_idx].fds[seg_idx] = -1; - fd_list[list_idx].count--; - - if (ret < 0) - return -1; - return 0; -} - static int get_seg_memfd(struct hugepage_info *hi __rte_unused, unsigned int list_idx __rte_unused, @@ -425,7 +346,7 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, } static int -resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, +resize_hugefile_in_memory(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) { int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | @@ -441,18 +362,12 @@ resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, strerror(errno)); return -1; } - /* increase/decrease total segment count */ - fd_list[list_idx].count += (grow ? 1 : -1); - if (!grow && fd_list[list_idx].count == 0) { - close(fd_list[list_idx].memseg_list_fd); - fd_list[list_idx].memseg_list_fd = -1; - } return 0; } static int -resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, - uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, + bool grow) { bool again = false; @@ -481,72 +396,22 @@ resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, } else { int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE; - int ret, lock_fd; + int ret; - /* if fallocate() is supported, we need to take out a - * read lock on allocate (to prevent other processes - * from deallocating this page), and take out a write - * lock on deallocate (to ensure nobody else is using - * this page). - * - * read locks on page itself are already taken out at - * file creation, in get_seg_fd(). - * - * we cannot rely on simple use of flock() call, because - * we need to be able to lock a section of the file, - * and we cannot use fcntl() locks, because of numerous - * problems with their semantics, so we will use - * deterministically named lock files for each section - * of the file. - * - * if we're shrinking the file, we want to upgrade our - * lock from shared to exclusive. - * - * lock_fd is an fd for a lockfile, not for the segment - * list. + /* + * technically, it is perfectly safe for both primary + * and secondary to grow and shrink the page files: + * growing the file repeatedly has no effect because + * a page can only be allocated once, while mmap ensures + * that secondaries hold on to the page even after the + * page itself is removed from the filesystem. + * + * however, leaving growing/shrinking to the primary + * tends to expose bugs in fdlist page count handling, + * so leave this here just in case. */ - lock_fd = get_segment_lock_fd(list_idx, seg_idx); - - if (!grow) { - /* we are using this lockfile to determine - * whether this particular page is locked, as we - * are in single file segments mode and thus - * cannot use regular flock() to get this info. - * - * we want to try and take out an exclusive lock - * on the lock file to determine if we're the - * last ones using this page, and if not, we - * won't be shrinking it, and will instead exit - * prematurely. - */ - ret = lock(lock_fd, LOCK_EX); - - /* drop the lock on the lockfile, so that even - * if we couldn't shrink the file ourselves, we - * are signalling to other processes that we're - * no longer using this page. - */ - if (unlock_segment(list_idx, seg_idx)) - RTE_LOG(ERR, EAL, "Could not unlock segment\n"); - - /* additionally, if this was the last lock on - * this segment list, we can safely close the - * page file fd, so that one of the processes - * could then delete the file after shrinking. - */ - if (ret < 1 && fd_list[list_idx].count == 0) { - close(fd); - fd_list[list_idx].memseg_list_fd = -1; - } - - if (ret < 0) { - RTE_LOG(ERR, EAL, "Could not lock segment\n"); - return -1; - } - if (ret == 0) - /* failed to lock, not an error. */ - return 0; - } + if (rte_eal_process_type() != RTE_PROC_PRIMARY) + return 0; /* grow or shrink the file */ ret = fallocate(fd, flags, fa_offset, page_sz); @@ -564,44 +429,43 @@ resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, strerror(errno)); return -1; } - } else { + } else fallocate_supported = 1; - - /* we've grew/shrunk the file, and we hold an - * exclusive lock now. check if there are no - * more segments active in this segment list, - * and remove the file if there aren't. - */ - if (fd_list[list_idx].count == 0) { - if (unlink(path)) - RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n", - __func__, path, - strerror(errno)); - close(fd); - fd_list[list_idx].memseg_list_fd = -1; - } - } } } while (again); return 0; } +static void +close_hugefile(int fd, char *path, int list_idx) +{ + /* + * primary process must unlink the file, but only when not in in-memory + * mode (as in that case there is no file to unlink). + */ + if (!internal_config.in_memory && + rte_eal_process_type() == RTE_PROC_PRIMARY && + unlink(path)) + RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n", + __func__, path, strerror(errno)); + + close(fd); + fd_list[list_idx].memseg_list_fd = -1; +} static int -resize_hugefile(int fd, char *path, int list_idx, int seg_idx, - uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) { - - /* in-memory mode is a special case, because we don't need to perform - * any locking, and we can be sure that fallocate() is supported. + /* in-memory mode is a special case, because we can be sure that + * fallocate() is supported. */ if (internal_config.in_memory) - return resize_hugefile_in_memory(fd, list_idx, fa_offset, + return resize_hugefile_in_memory(fd, fa_offset, page_sz, grow); - return resize_hugefile_in_filesystem(fd, path, list_idx, seg_idx, - fa_offset, page_sz, grow); + return resize_hugefile_in_filesystem(fd, fa_offset, page_sz, + grow); } static int @@ -663,10 +527,11 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (internal_config.single_file_segments) { map_offset = seg_idx * alloc_sz; - ret = resize_hugefile(fd, path, list_idx, seg_idx, - map_offset, alloc_sz, true); + ret = resize_hugefile(fd, map_offset, alloc_sz, true); if (ret < 0) goto resized; + + fd_list[list_idx].count++; } else { map_offset = 0; if (ftruncate(fd, alloc_sz) < 0) { @@ -769,15 +634,21 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, */ RTE_LOG(CRIT, EAL, "Can't mmap holes in our virtual address space\n"); } + /* roll back the ref count */ + if (internal_config.single_file_segments) + fd_list[list_idx].count--; resized: /* some codepaths will return negative fd, so exit early */ if (fd < 0) return -1; if (internal_config.single_file_segments) { - resize_hugefile(fd, path, list_idx, seg_idx, map_offset, - alloc_sz, false); + resize_hugefile(fd, map_offset, alloc_sz, false); /* ignore failure, can't make it any worse */ + + /* if refcount is at zero, close the file */ + if (fd_list[list_idx].count == 0) + close_hugefile(fd, path, list_idx); } else { /* only remove file if we can take out a write lock */ if (internal_config.hugepage_unlink == 0 && @@ -834,9 +705,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, if (internal_config.single_file_segments) { map_offset = seg_idx * ms->len; - if (resize_hugefile(fd, path, list_idx, seg_idx, map_offset, - ms->len, false)) + if (resize_hugefile(fd, map_offset, ms->len, false)) return -1; + + if (--(fd_list[list_idx].count) == 0) + close_hugefile(fd, path, list_idx); + ret = 0; } else { /* if we're able to take out a write lock, we're the last one @@ -1492,18 +1366,24 @@ alloc_list(int list_idx, int len) int *data; int i; - /* ensure we have space to store fd per each possible segment */ - data = malloc(sizeof(int) * len); - if (data == NULL) { - RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n"); - return -1; + /* single-file segments mode does not need fd list */ + if (!internal_config.single_file_segments) { + /* ensure we have space to store fd per each possible segment */ + data = malloc(sizeof(int) * len); + if (data == NULL) { + RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n"); + return -1; + } + /* set all fd's as invalid */ + for (i = 0; i < len; i++) + data[i] = -1; + fd_list[list_idx].fds = data; + fd_list[list_idx].len = len; + } else { + fd_list[list_idx].fds = NULL; + fd_list[list_idx].len = 0; } - /* set all fd's as invalid */ - for (i = 0; i < len; i++) - data[i] = -1; - fd_list[list_idx].fds = data; - fd_list[list_idx].len = len; fd_list[list_idx].count = 0; fd_list[list_idx].memseg_list_fd = -1; @@ -1551,20 +1431,10 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) int eal_memalloc_set_seg_list_fd(int list_idx, int fd) { - struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - /* non-single file segment mode doesn't support segment list fd's */ if (!internal_config.single_file_segments) return -ENOTSUP; - /* if list is not allocated, allocate it */ - if (fd_list[list_idx].len == 0) { - int len = mcfg->memsegs[list_idx].memseg_arr.len; - - if (alloc_list(list_idx, len) < 0) - return -ENOMEM; - } - fd_list[list_idx].memseg_list_fd = fd; return 0; @@ -1642,9 +1512,6 @@ eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) return -ENOTSUP; } - /* fd_list not initialized? */ - if (fd_list[list_idx].len == 0) - return -ENODEV; if (internal_config.single_file_segments) { size_t pgsz = mcfg->memsegs[list_idx].page_sz; @@ -1653,6 +1520,10 @@ eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) return -ENOENT; *offset = pgsz * seg_idx; } else { + /* fd_list not initialized? */ + if (fd_list[list_idx].len == 0) + return -ENODEV; + /* segment not active? */ if (fd_list[list_idx].fds[seg_idx] < 0) return -ENOENT; -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode Anatoly Burakov @ 2019-03-29 17:55 ` Anatoly Burakov 2019-04-02 14:08 ` Thomas Monjalon 1 sibling, 0 replies; 25+ messages in thread From: Anatoly Burakov @ 2019-03-29 17:55 UTC (permalink / raw) To: dev Cc: John McNamara, Marko Kovacevic, david.marchand, thomas, maxime.coquelin Due to internal glibc limitations [1], DPDK may exhaust internal file descriptor limits when using smaller page sizes, which results in inability to use system calls such as select() by user applications. Single file segments option stores lock files per page to ensure that pages are deleted when there are no more users, however this is not necessary because the processes will be holding onto the pages anyway because of mmap(). Thus, removing pages from the filesystem is safe even though they may be used by some other secondary process. As a result, single file segments mode no longer stores inordinate amounts of segment fd's, and the above issue with fd limits is solved. However, this will not work for legacy mem mode. For that, simply document that using bigger page sizes is the only option. [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../prog_guide/env_abstraction_layer.rst | 18 ++ lib/librte_eal/common/eal_filesystem.h | 11 - lib/librte_eal/linux/eal/eal_memalloc.c | 285 +++++------------- 3 files changed, 96 insertions(+), 218 deletions(-) diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index 2361c3b8f..faa5f6a0d 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -214,6 +214,24 @@ Normally, these options do not need to be changed. can later be mapped into that preallocated VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup. ++ Segment file descriptors + +On Linux, in most cases, EAL will store segment file descriptors in EAL. This +can become a problem when using smaller page sizes due to underlying limitations +of ``glibc`` library. For example, Linux API calls such as ``select()`` may not +work correctly because ``glibc`` does not support more than certain number of +file descriptors. + +There are two possible solutions for this problem. The recommended solution is +to use ``--single-file-segments`` mode, as that mode will not use a file +descriptor per each page, and it will keep compatibility with Virtio with +vhost-user backend. This option is not available when using ``--legacy-mem`` +mode. + +Another option is to use bigger page sizes. Since fewer pages are required to +cover the same memory area, fewer file descriptors will be stored internally +by EAL. + Support for Externally Allocated Memory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h index 89a3added..f2f83e712 100644 --- a/lib/librte_eal/common/eal_filesystem.h +++ b/lib/librte_eal/common/eal_filesystem.h @@ -98,17 +98,6 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id return buffer; } -/** String format for hugepage map lock files. */ -#define HUGEFILE_LOCK_FMT "%s/map_%d.lock" -static inline const char * -eal_get_hugefile_lock_path(char *buffer, size_t buflen, int f_id) -{ - snprintf(buffer, buflen, HUGEFILE_LOCK_FMT, rte_eal_get_runtime_dir(), - f_id); - buffer[buflen - 1] = '\0'; - return buffer; -} - /** define the default filename prefix for the %s values above */ #define HUGEFILE_PREFIX_DEFAULT "rte" diff --git a/lib/librte_eal/linux/eal/eal_memalloc.c b/lib/librte_eal/linux/eal/eal_memalloc.c index e90fe6c95..4c51fb8ca 100644 --- a/lib/librte_eal/linux/eal/eal_memalloc.c +++ b/lib/librte_eal/linux/eal/eal_memalloc.c @@ -83,13 +83,8 @@ static int fallocate_supported = -1; /* unknown */ /* * we have two modes - single file segments, and file-per-page mode. * - * for single-file segments, we need some kind of mechanism to keep track of - * which hugepages can be freed back to the system, and which cannot. we cannot - * use flock() because they don't allow locking parts of a file, and we cannot - * use fcntl() due to issues with their semantics, so we will have to rely on a - * bunch of lockfiles for each page. so, we will use 'fds' array to keep track - * of per-page lockfiles. we will store the actual segment list fd in the - * 'memseg_list_fd' field. + * for single-file segments, we use memseg_list_fd to store the segment fd, + * while the fds[] will not be allocated, and len will be set to 0. * * for file-per-page mode, each page will have its own fd, so 'memseg_list_fd' * will be invalid (set to -1), and we'll use 'fds' to keep track of page fd's. @@ -245,80 +240,6 @@ static int lock(int fd, int type) return 1; } -static int get_segment_lock_fd(int list_idx, int seg_idx) -{ - char path[PATH_MAX] = {0}; - int fd; - - if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) - return -1; - if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) - return -1; - - fd = fd_list[list_idx].fds[seg_idx]; - /* does this lock already exist? */ - if (fd >= 0) - return fd; - - eal_get_hugefile_lock_path(path, sizeof(path), - list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); - - fd = open(path, O_CREAT | O_RDWR, 0660); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n", - __func__, path, strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) != 1) { - RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n", - __func__, path, strerror(errno)); - close(fd); - return -1; - } - /* store it for future reference */ - fd_list[list_idx].fds[seg_idx] = fd; - fd_list[list_idx].count++; - return fd; -} - -static int unlock_segment(int list_idx, int seg_idx) -{ - int fd, ret; - - if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) - return -1; - if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) - return -1; - - fd = fd_list[list_idx].fds[seg_idx]; - - /* upgrade lock to exclusive to see if we can remove the lockfile */ - ret = lock(fd, LOCK_EX); - if (ret == 1) { - /* we've succeeded in taking exclusive lock, this lockfile may - * be removed. - */ - char path[PATH_MAX] = {0}; - eal_get_hugefile_lock_path(path, sizeof(path), - list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); - if (unlink(path)) { - RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n", - __func__, path, strerror(errno)); - } - } - /* we don't want to leak the fd, so even if we fail to lock, close fd - * and remove it from list anyway. - */ - close(fd); - fd_list[list_idx].fds[seg_idx] = -1; - fd_list[list_idx].count--; - - if (ret < 0) - return -1; - return 0; -} - static int get_seg_memfd(struct hugepage_info *hi __rte_unused, unsigned int list_idx __rte_unused, @@ -425,7 +346,7 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, } static int -resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, +resize_hugefile_in_memory(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) { int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | @@ -441,18 +362,12 @@ resize_hugefile_in_memory(int fd, int list_idx, uint64_t fa_offset, strerror(errno)); return -1; } - /* increase/decrease total segment count */ - fd_list[list_idx].count += (grow ? 1 : -1); - if (!grow && fd_list[list_idx].count == 0) { - close(fd_list[list_idx].memseg_list_fd); - fd_list[list_idx].memseg_list_fd = -1; - } return 0; } static int -resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, - uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, + bool grow) { bool again = false; @@ -481,72 +396,22 @@ resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, } else { int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE; - int ret, lock_fd; + int ret; - /* if fallocate() is supported, we need to take out a - * read lock on allocate (to prevent other processes - * from deallocating this page), and take out a write - * lock on deallocate (to ensure nobody else is using - * this page). - * - * read locks on page itself are already taken out at - * file creation, in get_seg_fd(). - * - * we cannot rely on simple use of flock() call, because - * we need to be able to lock a section of the file, - * and we cannot use fcntl() locks, because of numerous - * problems with their semantics, so we will use - * deterministically named lock files for each section - * of the file. - * - * if we're shrinking the file, we want to upgrade our - * lock from shared to exclusive. - * - * lock_fd is an fd for a lockfile, not for the segment - * list. + /* + * technically, it is perfectly safe for both primary + * and secondary to grow and shrink the page files: + * growing the file repeatedly has no effect because + * a page can only be allocated once, while mmap ensures + * that secondaries hold on to the page even after the + * page itself is removed from the filesystem. + * + * however, leaving growing/shrinking to the primary + * tends to expose bugs in fdlist page count handling, + * so leave this here just in case. */ - lock_fd = get_segment_lock_fd(list_idx, seg_idx); - - if (!grow) { - /* we are using this lockfile to determine - * whether this particular page is locked, as we - * are in single file segments mode and thus - * cannot use regular flock() to get this info. - * - * we want to try and take out an exclusive lock - * on the lock file to determine if we're the - * last ones using this page, and if not, we - * won't be shrinking it, and will instead exit - * prematurely. - */ - ret = lock(lock_fd, LOCK_EX); - - /* drop the lock on the lockfile, so that even - * if we couldn't shrink the file ourselves, we - * are signalling to other processes that we're - * no longer using this page. - */ - if (unlock_segment(list_idx, seg_idx)) - RTE_LOG(ERR, EAL, "Could not unlock segment\n"); - - /* additionally, if this was the last lock on - * this segment list, we can safely close the - * page file fd, so that one of the processes - * could then delete the file after shrinking. - */ - if (ret < 1 && fd_list[list_idx].count == 0) { - close(fd); - fd_list[list_idx].memseg_list_fd = -1; - } - - if (ret < 0) { - RTE_LOG(ERR, EAL, "Could not lock segment\n"); - return -1; - } - if (ret == 0) - /* failed to lock, not an error. */ - return 0; - } + if (rte_eal_process_type() != RTE_PROC_PRIMARY) + return 0; /* grow or shrink the file */ ret = fallocate(fd, flags, fa_offset, page_sz); @@ -564,44 +429,43 @@ resize_hugefile_in_filesystem(int fd, char *path, int list_idx, int seg_idx, strerror(errno)); return -1; } - } else { + } else fallocate_supported = 1; - - /* we've grew/shrunk the file, and we hold an - * exclusive lock now. check if there are no - * more segments active in this segment list, - * and remove the file if there aren't. - */ - if (fd_list[list_idx].count == 0) { - if (unlink(path)) - RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n", - __func__, path, - strerror(errno)); - close(fd); - fd_list[list_idx].memseg_list_fd = -1; - } - } } } while (again); return 0; } +static void +close_hugefile(int fd, char *path, int list_idx) +{ + /* + * primary process must unlink the file, but only when not in in-memory + * mode (as in that case there is no file to unlink). + */ + if (!internal_config.in_memory && + rte_eal_process_type() == RTE_PROC_PRIMARY && + unlink(path)) + RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n", + __func__, path, strerror(errno)); + + close(fd); + fd_list[list_idx].memseg_list_fd = -1; +} static int -resize_hugefile(int fd, char *path, int list_idx, int seg_idx, - uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) { - - /* in-memory mode is a special case, because we don't need to perform - * any locking, and we can be sure that fallocate() is supported. + /* in-memory mode is a special case, because we can be sure that + * fallocate() is supported. */ if (internal_config.in_memory) - return resize_hugefile_in_memory(fd, list_idx, fa_offset, + return resize_hugefile_in_memory(fd, fa_offset, page_sz, grow); - return resize_hugefile_in_filesystem(fd, path, list_idx, seg_idx, - fa_offset, page_sz, grow); + return resize_hugefile_in_filesystem(fd, fa_offset, page_sz, + grow); } static int @@ -663,10 +527,11 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (internal_config.single_file_segments) { map_offset = seg_idx * alloc_sz; - ret = resize_hugefile(fd, path, list_idx, seg_idx, - map_offset, alloc_sz, true); + ret = resize_hugefile(fd, map_offset, alloc_sz, true); if (ret < 0) goto resized; + + fd_list[list_idx].count++; } else { map_offset = 0; if (ftruncate(fd, alloc_sz) < 0) { @@ -769,15 +634,21 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, */ RTE_LOG(CRIT, EAL, "Can't mmap holes in our virtual address space\n"); } + /* roll back the ref count */ + if (internal_config.single_file_segments) + fd_list[list_idx].count--; resized: /* some codepaths will return negative fd, so exit early */ if (fd < 0) return -1; if (internal_config.single_file_segments) { - resize_hugefile(fd, path, list_idx, seg_idx, map_offset, - alloc_sz, false); + resize_hugefile(fd, map_offset, alloc_sz, false); /* ignore failure, can't make it any worse */ + + /* if refcount is at zero, close the file */ + if (fd_list[list_idx].count == 0) + close_hugefile(fd, path, list_idx); } else { /* only remove file if we can take out a write lock */ if (internal_config.hugepage_unlink == 0 && @@ -834,9 +705,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, if (internal_config.single_file_segments) { map_offset = seg_idx * ms->len; - if (resize_hugefile(fd, path, list_idx, seg_idx, map_offset, - ms->len, false)) + if (resize_hugefile(fd, map_offset, ms->len, false)) return -1; + + if (--(fd_list[list_idx].count) == 0) + close_hugefile(fd, path, list_idx); + ret = 0; } else { /* if we're able to take out a write lock, we're the last one @@ -1492,18 +1366,24 @@ alloc_list(int list_idx, int len) int *data; int i; - /* ensure we have space to store fd per each possible segment */ - data = malloc(sizeof(int) * len); - if (data == NULL) { - RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n"); - return -1; + /* single-file segments mode does not need fd list */ + if (!internal_config.single_file_segments) { + /* ensure we have space to store fd per each possible segment */ + data = malloc(sizeof(int) * len); + if (data == NULL) { + RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n"); + return -1; + } + /* set all fd's as invalid */ + for (i = 0; i < len; i++) + data[i] = -1; + fd_list[list_idx].fds = data; + fd_list[list_idx].len = len; + } else { + fd_list[list_idx].fds = NULL; + fd_list[list_idx].len = 0; } - /* set all fd's as invalid */ - for (i = 0; i < len; i++) - data[i] = -1; - fd_list[list_idx].fds = data; - fd_list[list_idx].len = len; fd_list[list_idx].count = 0; fd_list[list_idx].memseg_list_fd = -1; @@ -1551,20 +1431,10 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) int eal_memalloc_set_seg_list_fd(int list_idx, int fd) { - struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - /* non-single file segment mode doesn't support segment list fd's */ if (!internal_config.single_file_segments) return -ENOTSUP; - /* if list is not allocated, allocate it */ - if (fd_list[list_idx].len == 0) { - int len = mcfg->memsegs[list_idx].memseg_arr.len; - - if (alloc_list(list_idx, len) < 0) - return -ENOMEM; - } - fd_list[list_idx].memseg_list_fd = fd; return 0; @@ -1642,9 +1512,6 @@ eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) return -ENOTSUP; } - /* fd_list not initialized? */ - if (fd_list[list_idx].len == 0) - return -ENODEV; if (internal_config.single_file_segments) { size_t pgsz = mcfg->memsegs[list_idx].page_sz; @@ -1653,6 +1520,10 @@ eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) return -ENOENT; *offset = pgsz * seg_idx; } else { + /* fd_list not initialized? */ + if (fd_list[list_idx].len == 0) + return -ENODEV; + /* segment not active? */ if (fd_list[list_idx].fds[seg_idx] < 0) return -ENOENT; -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode Anatoly Burakov 2019-03-29 17:55 ` Anatoly Burakov @ 2019-04-02 14:08 ` Thomas Monjalon 2019-04-02 14:08 ` Thomas Monjalon 1 sibling, 1 reply; 25+ messages in thread From: Thomas Monjalon @ 2019-04-02 14:08 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, John McNamara, Marko Kovacevic, david.marchand, maxime.coquelin 29/03/2019 18:55, Anatoly Burakov: > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > Single file segments option stores lock files per page to ensure > that pages are deleted when there are no more users, however this > is not necessary because the processes will be holding onto the > pages anyway because of mmap(). Thus, removing pages from the > filesystem is safe even though they may be used by some other > secondary process. As a result, single file segments mode no > longer stores inordinate amounts of segment fd's, and the above > issue with fd limits is solved. > > However, this will not work for legacy mem mode. For that, simply > document that using bigger page sizes is the only option. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Applied, thanks ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode 2019-04-02 14:08 ` Thomas Monjalon @ 2019-04-02 14:08 ` Thomas Monjalon 0 siblings, 0 replies; 25+ messages in thread From: Thomas Monjalon @ 2019-04-02 14:08 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, John McNamara, Marko Kovacevic, david.marchand, maxime.coquelin 29/03/2019 18:55, Anatoly Burakov: > Due to internal glibc limitations [1], DPDK may exhaust internal > file descriptor limits when using smaller page sizes, which results > in inability to use system calls such as select() by user > applications. > > Single file segments option stores lock files per page to ensure > that pages are deleted when there are no more users, however this > is not necessary because the processes will be holding onto the > pages anyway because of mmap(). Thus, removing pages from the > filesystem is safe even though they may be used by some other > secondary process. As a result, single file segments mode no > longer stores inordinate amounts of segment fd's, and the above > issue with fd limits is solved. > > However, this will not work for legacy mem mode. For that, simply > document that using bigger page sizes is the only option. > > [1] https://mails.dpdk.org/archives/dev/2019-February/124386.html > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Applied, thanks ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2019-04-02 14:08 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-22 17:12 [dpdk-dev] [PATCH] eal: add option to not store segment fd's Anatoly Burakov 2019-03-29 9:50 ` David Marchand 2019-03-29 9:50 ` David Marchand 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 10:33 ` Burakov, Anatoly 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 11:34 ` Thomas Monjalon 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:05 ` Burakov, Anatoly 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 12:40 ` Thomas Monjalon 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:24 ` Burakov, Anatoly 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 13:34 ` Thomas Monjalon 2019-03-29 14:21 ` Burakov, Anatoly 2019-03-29 14:21 ` Burakov, Anatoly 2019-03-29 13:35 ` Maxime Coquelin 2019-03-29 13:35 ` Maxime Coquelin 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 1/2] memalloc: refactor segment resizing code Anatoly Burakov 2019-03-29 17:55 ` Anatoly Burakov 2019-03-29 17:55 ` [dpdk-dev] [PATCH v2 2/2] memalloc: do not use lockfiles for single file segments mode Anatoly Burakov 2019-03-29 17:55 ` Anatoly Burakov 2019-04-02 14:08 ` Thomas Monjalon 2019-04-02 14:08 ` Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).