From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <knasim@sidebandnetworks.com>
Received: from mail-vn0-f47.google.com (mail-vn0-f47.google.com
 [209.85.216.47]) by dpdk.org (Postfix) with ESMTP id E6E63C346
 for <dev@dpdk.org>; Thu, 16 Apr 2015 02:43:02 +0200 (CEST)
Received: by vnbg62 with SMTP id g62so21769988vnb.6
 for <dev@dpdk.org>; Wed, 15 Apr 2015 17:43:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=sidebandnetworks.com; s=google;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :content-type; bh=/Ii6IWdMqXTkHwabsnV5ulnQCV5wfoqx0JelAbAftc0=;
 b=QLkukfAs6cjaBgYb4EVAQLVPZiLR8JuSBoxbhsxw9yvY4k4ktDF9WFu1kW5q/2Bh+P
 Vej9y/yK4lRds2KPfddUQBCLPt2znyl/T9hYFC/UjrL9iF8nQddeoI54EQkwTqfDaa2r
 jaqIvnreVhoaO4GE+CkOmVkfwWi8VRBr6G+6Q=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:content-type;
 bh=/Ii6IWdMqXTkHwabsnV5ulnQCV5wfoqx0JelAbAftc0=;
 b=kiOfJQA0r7U7Tf6sFcI0c9dTP+JgG05vDwT1+99cXZBDyPzQBaf1DxAOVFkIvDUb6B
 dAvcWKk8qnI+sToe9gtqiebrJAd8MvzvjY7/UgueWxqt7ub19hydCXWdRzZK4wgGB+oT
 vc3e2if5ME6wk7lubazt/rz3yJ/sOU3cGpcOiCstqGvJbVw1ozp1CPcWSRFqn2YKPUfJ
 flWrPMkxjaqc3IKYk38812Lm3cH1dPP0nr2LZoDsozrp1sY4N84KOck54pgqz3V5zlgS
 /qVVX0V0nZZR9IaPrb1VymnDq0ZiXlinxRup1Kl4jFDdffNcdje6XQxj9e7nAneboXTT
 4faw==
X-Gm-Message-State: ALoCoQmT3Ew4bDTC/DA1PGXtza0TeqMImD6Z5ex24ETaeg8bomweBgXExFEvbN8Bt9MeJOjcZ8GC
MIME-Version: 1.0
X-Received: by 10.60.58.165 with SMTP id s5mr23940471oeq.2.1429144982265; Wed,
 15 Apr 2015 17:43:02 -0700 (PDT)
Received: by 10.202.4.196 with HTTP; Wed, 15 Apr 2015 17:43:02 -0700 (PDT)
In-Reply-To: <mailman.9407.1420917966.2352.dev@dpdk.org>
References: <mailman.9407.1420917966.2352.dev@dpdk.org>
Date: Wed, 15 Apr 2015 17:43:02 -0700
Message-ID: <CAPrTskgnK3XNZNkeTe_qC=yfy9O7gKpD1bY9=ckwOSx83xUyMQ@mail.gmail.com>
From: Kamraan Nasim <knasim@sidebandnetworks.com>
To: dev@dpdk.org, newman555p@gmail.com, liran@weka.io
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] dev Digest, Vol 22, Issue 37
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Apr 2015 00:43:03 -0000

>
> This had me stumped for a while as well. In my case, PostGres9.4 was also
> running on my system which also used huge pages and came up before my DPDK
> application causing rte_mempool_create() to ENOMEM.


Check which other applications are using huge pages:
> lsof | grep huge

And see if you can disable huge pages for them or increase the total pages
you're allocating in Kernel.


--Kam


>


> Date: Sat, 10 Jan 2015 21:26:03 +0200
> From: Liran Zvibel <liran@weka.io>
> To: Newman Poborsky <newman555p@gmail.com>, "dev@dpdk.org"
>         <dev@dpdk.org>
> Subject: Re: [dpdk-dev] rte_mempool_create fails with ENOMEM
> Message-ID:
>         <CAF28U9ORGNY7=
> QUKrd-ZCGn6HqBw7h6NE7wxUszf6WxOY18geg@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hi Newman,
>
> There are two options, either one of your pools is very large, and
> just does not fit in half of the memory,
> so if the physical memory must be split it just can never work, or
> what you?re seeing is localized to your
> environment, and just when allocating from both NUMAs the huge pages
> just happen to be to scattered
> for your pools to be allocated.
>
> In any case, we also have to deal with large pools that don?t always
> fit into consecutive huge pages as
> allocated by the kernel. I have created a small patch to DPDK itself,
> then some more code that can live
> as part of the dpdk application that does the scattered allocation.
>
> I?m going to send both parts here (the change to the DPDK and the user
> part). I don?t know what are the
> rules that allow pushing to the repository, so I won?t try to do so.
>
> First ? the DPDK patch, that just makes sure that the huge pates are
> mapped in a continuous virtual memory,
> and then the memory segments are allocated continuously in virtual
> memory: I?m attaching full mbox content to make it easier
> for you to use if you?d like. I created it against 1.7.1, since that
> is the version we?re  using. If you?d like, I can also create it
> against 1.8.0
>
> ====================================================
>
> >From 10ebc74eda2c3fe9e5a34815e0f7ee1f44d99aa3 Mon Sep 17 00:00:00 2001
> From: Liran Zvibel <liran@weka.io>
> Date: Sat, 10 Jan 2015 12:46:54 +0200
> Subject: [PATCH] Add an option to allocate huge pages in contiunous virtual
>  addresses
> To: dev@dpdk.org
>
> Add a configuration option: CONFIG_RTE_EAL_HUGEPAGES_SINGLE_CONT_VADDR
> that advises the memory sengment allocation code to allocate as many
> hugemages in a continuous way in virtual addresses as possible.
>
> This way, a mempool may be created out of disparsed memzones allocated
> from these new continuos memory segments.
> ---
>  lib/librte_eal/linuxapp/eal/eal_memory.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index f2454f4..b8d68b0 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -329,6 +329,7 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
>
>  #ifndef RTE_EAL_SINGLE_FILE_SEGMENTS
>                 else if (vma_len == 0) {
> +#ifndef RTE_EAL_HUGEPAGES_SINGLE_CONT_VADDR
>                         unsigned j, num_pages;
>
>                         /* reserve a virtual area for next contiguous
> @@ -340,6 +341,14 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
>                                         break;
>                         }
>                         num_pages = j - i;
> +#else // hugepages are will be allocated in a continous virtual address
> way
> +                       unsigned num_pages;
> +                       /* We will reserve a virtual area large enough
> to fit ALL
> +                        * physical blocks.
> +                        * This way we can have bigger mempools even
> if there is no
> +                        * continuos physcial region.
>         */
> +                       num_pages = hpi->num_pages[0] - i;
> +#endif
>                         vma_len = num_pages * hugepage_sz;
>
>                         /* get the biggest virtual memory area up to
> @@ -1268,6 +1277,16 @@ rte_eal_hugepage_init(void)
>                         new_memseg = 1;
>
>                 if (new_memseg) {
> +#ifdef RTE_EAL_HUGEPAGES_SINGLE_CONT_VADDR
> +                       if (0 <= j) {
> +                               RTE_LOG(DEBUG, EAL, "Closing memory
> segment #%d(%p) vaddr is %p phys is 0x%lx size is 0x%lx "
> +                                       "which is #%ld pages next
> vaddr will be at 0x%lx\n",
> +                                       j,&mcfg->memseg[j],
> +                                       mcfg->memseg[j].addr,
> mcfg->memseg[j].phys_addr, mcfg->memseg[j].len,
> +                                       mcfg->memseg[j].len /
> mcfg->memseg[j].hugepage_sz,
> +                                       mcfg->memseg[j].addr_64 +
> mcfg->memseg[j].len);
> +                       }
> +#endif
>                         j += 1;
>                         if (j == RTE_MAX_MEMSEG)
>                                 break;
> --
> 1.9.3 (Apple Git-50)
>
> ================================================================
>
> Then there is the dpdk-application library part that implements the
> struct rte_mempool *scattered_mempool_create(uint32_t elt_size,
> uint32_t elt_num, int32_t socket_id,
>                                              rte_mempool_ctor_t
> *mp_init, void *mp_init_arg,
>                                              rte_mempool_obj_ctor_t
> *obj_init, void *obj_init_arg)
>
> interface. If you would like, I can easily break the different
> functions into their right place in the rte_memseg and rte_mempool
> DPDK modules and have it included as another interface of the DPDK
> library (as suggested by Konstantin below)
>
> =====================================================
> static inline int  is_memseg_valid(struct rte_memseg * free_memseg,
> size_t requested_page_size,
>                                    int socket_id)
> {
>         if (free_memseg->len == 0) {
>                 return 0;
>         }
>
>         if (socket_id != SOCKET_ID_ANY &&
>             free_memseg->socket_id != SOCKET_ID_ANY &&
>             free_memseg->socket_id != socket_id) {
>                 RTE_LOG(DEBUG, USER1, "memseg goes not qualify for
> socked_id, requested %d got %d",
>                          socket_id, free_memseg->socket_id);
>                 return 0;
>         }
>
>         if (free_memseg->len < requested_page_size) {
>                 RTE_LOG(DEBUG, USER1, "memseg too small. len %lu <
> requested_page_size %lu",
>                          free_memseg->len, requested_page_size);
>                 return 0;
>         }
>
>
>         if (free_memseg->hugepage_sz != requested_page_size) {
>                 RTE_LOG(DEBUG, USER1, "memset hugepage size !=
> requested page size %lu != %lu",
>                          free_memseg->hugepage_sz,
>                          requested_page_size);
>                 return 0;
>         }
>
>         return 1;
> }
>
> static int try_allocating_memseg_range(struct rte_memseg *
> free_memseg, int start,
>                                        int requested_page_size, size_t
> len, int socket_id)
> {
>         int i;
>         for (i = start; i < RTE_MAX_MEMSEG; i++) {
>                 if (free_memseg[i].addr == NULL) {
>                         return -1;
>                 }
>
>                 if (!is_memseg_valid(free_memseg +i,
> requested_page_size, socket_id)) {
>                         return -1;
>                 }
>
>                 if ((start != i) &&
>                     ((char *)free_memseg[i].addr !=
> (char*)free_memseg[i-1].addr + free_memseg[i-1].len)) {
>                         RTE_LOG(DEBUG, USER1, "Looking for cont memseg
> range. "
>                                  "[%d].vaddr %p != [%d].vaddr %p +
> [i-1].len %lu == %p",
>                                  i, free_memseg[i].addr, i-1,
> free_memseg[i-1].addr,
>                                  free_memseg[i-1].len,
>                                  (char *)(free_memseg[i-1].addr) +
> free_memseg[i-1].len);
>                         return -1;
>                 }
>
>                 if ((free_memseg[i].len < len) && ((free_memseg[i].len
> % requested_page_size) != 0)) {
>                 RTE_LOG(DEBUG, USER1, "#%d memseg length not a
> multplie of page size, or last."
>                          " len %lu len %% requsted_pg_size %lu,
> requested_pg_sz %d",
>                          i, free_memseg[i].len, free_memseg[i].len %
> requested_page_size, requested_page_size);
>                 return -1;
>                 }
>
>
>                 if (len <= free_memseg[i].len) {
>                         RTE_LOG(DEBUG, USER1, "Successfuly finished
> lookng for memsegs. remaining req. "
>                                  "len %lu seg_len %lu, start %d i %d",
>                                  len, free_memseg[i].len, start, i);
>                         return i - start +1;
>                 }
>
>                 if (i == start)  {
>                         // We may not start on the beginning, have to
> move to next pagesize alignment...
>                         char * aligned_vaddr =
> RTE_PTR_ALIGN_CEIL(free_memseg[i].addr, requested_page_size);
>                         size_t diff = (size_t)(aligned_vaddr - (char
> *)free_memseg[i].addr);
>                         if ((free_memseg[i].len - diff) %
> requested_page_size != 0) {
>                                 RTE_LOG(ERR, USER1, "BUG! First
> segment is not page aligned! vaddr %p aligned "
>                                            "vaddr %p diff %lu len %lu,
> len - diff %lu, "
>                                            "(len%%diff)/%d == %lu",
>                                            free_memseg[i].addr,
> aligned_vaddr, diff, free_memseg[i].len,
>                                            free_memseg[i].len - diff,
>                                            requested_page_size,
>                                            (free_memseg[i].len - diff)
> % requested_page_size);
>                                 return -1;
>                         } else if (0 == free_memseg[i].len - diff) {
>                                 RTE_LOG(DEBUG, USER1, "After
> alignment, first memseg is empty!");
>                                 return -1;
>                         }
>
>                         RTE_LOG(DEBUG, USER1, "First memseg gives
> (after alignment) len %lu out of potential %lu",
>                                  (free_memseg[i].len - diff),
> free_memseg[i].len);
>                         len -= (free_memseg[i].len - diff);
>                 }
>                 len -= free_memseg[i].len;
>         }
>
>         return -1;
> }
>
>
> /**
>  * Will register several memory zones, in continueues virtual
> addresses of large size.
>  * All first memzones will use full pages, only the last memzone may
> request less than a full hugepage.
>  *
>  * It will go through all the free memory segments, once it finds a
> memsegment with full hugepages, it
>  * will check wheter it can start allocating from that memory segment on.
>  */
> static const  struct rte_memzone *
> memzone_reserve_multiple_cont_mz(const char * basename, size_t *
> zones_len, size_t len, int socket_id,
>                                  unsigned flags, unsigned align)
> {
> struct rte_mem_config *mcfg;
>         const struct rte_memzone * ret = NULL;
>         size_t requested_page_size;
>         int i;
>         struct rte_memseg * free_memseg = NULL;
>         int first_memseg = -1;
>         int memseg_count = -1;
>
>         mcfg = rte_eal_get_configuration()->mem_config;
>         free_memseg = mcfg->free_memseg;
>
>         RTE_LOG(DEBUG, USER1, "mcfg is at %p free_memseg at %p memseg
> at %p", mcfg, mcfg->free_memseg, mcfg->memseg);
>
>         for (i = 0; i  < 10 && (free_memseg[i].addr != NULL); i++) {
>                 RTE_LOG(DEBUG, USER1, "free_memseg[%d] : vaddr 0x%p
> phys_addr 0x%p len %lu pages: %lu [0x%lu]", i,
>                          free_memseg[i].addr,
>                          (void*)free_memseg[i].phys_addr,
> free_memseg[i].len, free_memseg[i].len/free_memseg[i].hugepage_sz,
>                          free_memseg[i].hugepage_sz);
>         }
>
>
>         for (i = 0; i  < 10 && (mcfg->memseg[i].addr != NULL); i++) {
>                 RTE_LOG(DEBUG, USER1, "memseg[%d] : vaddr 0x%p
> phys_addr 0x%p len %lu pages: %lu [0x%lu]", i,
>                          mcfg->memseg[i].addr,
>                          (void*)mcfg->memseg[i].phys_addr,
> mcfg->memseg[i].len,
>                          mcfg->memseg[i].len/mcfg->memseg[i].hugepage_sz,
>                          mcfg->memseg[i].hugepage_sz);
>         }
>
>         *zones_len = 0;
>
>         if (mcfg->memzone_idx >= RTE_MAX_MEMZONE) {
>                 RTE_LOG(DEBUG, USER1, "No more room for new memzones");
>                 return NULL;
>         }
>
>         if ((flags & (RTE_MEMZONE_2MB | RTE_MEMZONE_1GB)) == 0) {
>                 RTE_LOG(DEBUG, USER1, "Must request either 2MB or 1GB
> pages");
>                 return NULL;
>         }
>
>         if ((flags & RTE_MEMZONE_1GB ) && (flags & RTE_MEMZONE_2MB)) {
>                 RTE_LOG(DEBUG, USER1, "Cannot request both 1GB and 2MB
> pages");
>                 return NULL;
>         }
>
>         if (flags & RTE_MEMZONE_2MB) {
>                 requested_page_size = RTE_PGSIZE_2M;
>         } else {
>                 requested_page_size = RTE_PGSIZE_1G;
>         }
>
>         if (len < requested_page_size) {
>                 RTE_LOG(DEBUG, USER1, "Requested length %lu is smaller
> than requested pages size %lu",
>                          len , requested_page_size);
>                 return NULL;
>         }
>
>         ret = rte_memzone_reserve_aligned(basename, len, socket_id,
> flags, align);
>         if (ret != NULL) {
>                 RTE_LOG(DEBUG, USER1, "Normal
> rte_memzone_reserve_aligned worked!");
>                 *zones_len = 1;
>                 return ret;
>         }
>
>         RTE_LOG(DEBUG, USER1, "rte_memzone_reserve_aligned failed.
> Will have to allocate on our own");
>         rte_rwlock_write_lock(&mcfg->mlock);
>
>         for (i = 0; i < RTE_MAX_MEMSEG; i++) {
>                 if (free_memseg[i].addr == NULL) {
>                         break;
>                 }
>
>                 if (!is_memseg_valid(free_memseg +i,
> requested_page_size, socket_id)) {
>                         continue;
>                 }
>
>                 memseg_count =
> try_allocating_memseg_range(free_memseg, i, requested_page_size, len,
>                                                            socket_id);
>                 if (0 < memseg_count ) {
>                         RTE_LOG(DEBUG, USER1, "Was able to find
> memsegments for zone! "
>                                  "first segment: %d segment_count %d len
> %lu",
>                                  i, memseg_count, len);
>                         first_memseg = i;
>
>                         // Fix first memseg -- make sure it's page aligned!
>                         char * aligned_vaddr =
> RTE_PTR_ALIGN_CEIL(free_memseg[i].addr,
>
> requested_page_size);
>                         size_t diff = (size_t)(aligned_vaddr - (char
> *)free_memseg[i].addr);
>                         RTE_LOG(DEBUG, USER1, "Decreasing first
> segment by %lu", diff);
>                         free_memseg[i].addr = aligned_vaddr;
>                         free_memseg[i].phys_addr += diff;
>                         free_memseg[i].len -= diff;
>                         if ((free_memseg[i].phys_addr %
> requested_page_size != 0)) {
>                                 RTE_LOG(ERR, USER1, "After aligning
> first free memseg, "
>                                            "physical address NOT page
> aligned! %p",
>
>  (void*)free_memseg[i].phys_addr);
>                                 abort();
>                         }
>
>                         break;
>                 }
>         }
>
>         if (first_memseg < 0) {
>                 RTE_LOG(DEBUG, USER1, "Could not find memsegs to
> allocate enough memory");
>                 goto out;
>         }
>
>         // now perform actual allocation.
>         if (mcfg->memzone_idx + memseg_count >= RTE_MAX_MEMZONE) {
>                 RTE_LOG(DEBUG, USER1, "There are not enough memzones
> to allocate. "
>                          "memzone_idx %d memseg_count %d max %s=%d",
>                          mcfg->memzone_idx, memseg_count,
> RTE_STR(RTE_MAX_MEMZONE), RTE_MAX_MEMZONE);
>                 goto out;
>         }
>
>         ret = &mcfg->memzone[mcfg->memzone_idx];
>         *zones_len = memseg_count;
>         for (i = first_memseg; i < first_memseg + memseg_count; i++) {
>                 size_t allocated_length;
>                 if (free_memseg[i].len <= len) {
>                         allocated_length = free_memseg[i].len;
>                 } else {
>                         allocated_length = len;
>                 }
>
>                 struct rte_memzone * mz =
> &mcfg->memzone[mcfg->memzone_idx++];
>                 snprintf(mz->name, sizeof(mz->name), "%s%d", basename,
> i - first_memseg);
>                 mz->phys_addr   = free_memseg[i].phys_addr;
>                 mz->addr        = free_memseg[i].addr;
>                 mz->len         = allocated_length;
>                 mz->hugepage_sz = free_memseg[i].hugepage_sz;
>                 mz->socket_id   = free_memseg[i].socket_id;
>                 mz->flags       = 0;
>                 mz->memseg_id   = i;
>
>                 free_memseg[i].len -= allocated_length;
>                 free_memseg[i].phys_addr += allocated_length;
>                 free_memseg[i].addr_64 += allocated_length;
>                 len -= allocated_length;
>         }
>
>         if (len != 0) {
>                 RTE_LOG(DEBUG, USER1, "After registering all the
> memzone, len is too small! Len is %lu", len);
>                 ret = NULL;
>                 goto out;
>         }
> out:
>         rte_rwlock_write_unlock(&mcfg->mlock);
>         return ret;
> }
>
>
> static inline void build_physical_pages(phys_addr_t * phys_pages, int
> num_phys_pages, size_t sz,
>                                         const struct rte_memzone * mz,
> int num_zones)
> {
>         size_t accounted_for_size =0;
>         int curr_page = 0;
>         int i;
>         unsigned j;
>
>         RTE_LOG(DEBUG, USER1, "Phys pages are at %p 2M is %d mz
> pagesize is %lu trailing zeros: %d",
>                  phys_pages, RTE_PGSIZE_2M, mz->hugepage_sz,
> __builtin_ctz(mz->hugepage_sz));
>
>         for (i = 0; i < num_zones; i++) {
>                 size_t mz_remaining_len = mz[i].len;
>                 for (j = 0; (j <= mz[i].len / RTE_PGSIZE_2M) && (0 <
> mz_remaining_len) ; j++) {
>                         phys_pages[curr_page++] = mz[i].phys_addr + j
> * RTE_PGSIZE_2M;
>
>                         size_t added_len =
> RTE_MIN((size_t)RTE_PGSIZE_2M, mz_remaining_len);
>                         accounted_for_size += added_len;
>                         mz_remaining_len -= added_len;
>
>                         if (sz <= accounted_for_size) {
>                                 RTE_LOG(DEBUG, USER1, "Filled in %d
> pages of the physical pages array", curr_page);
>                                 return;
>                         }
>                         if (num_phys_pages < curr_page) {
>                                 RTE_LOG(ERR, USER1, "When building
> physcial pages array, "
>                                            "used pages (%d) is more
> than allocated pages %d. "
>                                            "accounted size %lu size %lu",
>                                            curr_page, num_phys_pages,
> accounted_for_size, sz);
>                                 abort();
>                         }
>                 }
>         }
>
>         if (accounted_for_size < sz) {
>                 RTE_LOG(ERR, USER1, "Finished going over %d memory
> zones, and still accounted size is %lu "
>                            "and requested size is %lu",
>                            num_zones, accounted_for_size, sz);
>                 abort();
>         }
> }
>
> struct rte_mempool *scattered_mempool_create(uint32_t elt_size,
> uint32_t elt_num, int32_t socket_id,
>                                              rte_mempool_ctor_t
> *mp_init, void *mp_init_arg,
>                                              rte_mempool_obj_ctor_t
> *obj_init, void *obj_init_arg)
> {
>         struct rte_mempool *mp;
>         const struct rte_memzone *mz;
>         size_t                          num_zones;
>         struct rte_mempool_objsz obj_sz;
>         uint32_t flags, total_size;
>         size_t sz;
>
>         flags = (MEMPOOL_F_NO_SPREAD|MEMPOOL_F_SC_GET|MEMPOOL_F_SP_PUT);
>
>         total_size = rte_mempool_calc_obj_size(elt_size, flags, &obj_sz);
>
>         sz = elt_num * total_size;
>         /* We now have to account for the "gaps" at the end of each
> page. Worst case is that we get
>          * all distinct pages, so we have to add the gap for each
> possible page */
>         int pages_num = (sz + RTE_PGSIZE_2M -1) / RTE_PGSIZE_2M;
>         int page_gap = RTE_PGSIZE_2M % elt_size;
>         sz += pages_num + page_gap;
>
>         RTE_LOG(DEBUG, USER1, "Will have to allocate %d 2M pages for
> the page table.", pages_num);
>
>         if ((mz = memzone_reserve_multiple_cont_mz("data_obj",
> &num_zones, sz, socket_id,
>                                                    RTE_MEMZONE_2MB,
> RTE_PGSIZE_2M)) == NULL) {
>                 RTE_LOG(WARNING, USER1, "memzone reserve multi mz
> returned NULL for socket id %d, will try ANY",
>                           socket_id);
>                 if ((mz =
>                      memzone_reserve_multiple_cont_mz("data_obj",
> &num_zones, sz, socket_id,
>                                                       RTE_MEMZONE_2MB,
> RTE_PGSIZE_2M)) == NULL) {
>                         RTE_LOG(ERR, USER1, "memzone reserve multi mz
> returned NULL even for any socket");
>                         return NULL;
>                 } else {
>                         RTE_LOG(DEBUG, USER1, "memzone reserve multi
> mz returne %p with %lu zones for SOCKET_ID_ANY",
>                                  mz, num_zones);
>                 }
>         } else {
>                 RTE_LOG(DEBUG, USER1, "memzone reserve multi mz
> returned %p with %lu zones for size %lu  socket %d",
>                          mz, num_zones, sz, socket_id);
>         }
>
>         // Now will "break" the pages into smaller ones
>         phys_addr_t * phys_pages = malloc(sizeof(phys_addr_t)*pages_num);
>         if(phys_pages == NULL) {
>             RTE_LOG(DEBUG, USER1, "phys_pages is null. aborting");
>             abort();
>         }
>
>         build_physical_pages(phys_pages, pages_num, sz, mz, num_zones);
>         RTE_LOG(DEBUG, USER1, "Beginning of vaddr is %p beginning of
> physical addr is 0x%lx", mz->addr, mz->phys_addr);
>         mp = rte_mempool_xmem_create("data_pool", elt_num, elt_size,
>                                      257 , sizeof(struct
> rte_pktmbuf_pool_private),
>                                      mp_init, mp_init_arg, obj_init,
> obj_init_arg,
>                                      socket_id, flags, (char *)mz[0].addr,
>                                      phys_pages, pages_num,
> rte_bsf32(RTE_PGSIZE_2M));
>
>         RTE_LOG(DEBUG, USER1, "rte_mempool_xmem_create returned %p", mp);
>         return mp;
> }
>
> =================================================================
>
> Please let me know if you have any questions/comments about this code.
>
> Best Regards,
>
> Liran.
>
> On Jan 8, 2015, at 10:19, Newman Poborsky <newman555p@gmail.com> wrote:
>
> I finally found the time to try this and I noticed that on a server
> with 1 NUMA node, this works, but if  server has 2 NUMA nodes than by
> default memory policy, reserved hugepages are divided on each node and
> again DPDK test app fails for the reason already mentioned. I found
> out that 'solution' for this is to deallocate hugepages on node1
> (after boot) and leave them only on node0:
> echo 0 >
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
>
> Could someone please explain what changes when there are hugepages on
> both nodes? Does this cause some memory fragmentation so that there
> aren't enough contiguous segments? If so, how?
>
> Thanks!
>
> Newman
>
> On Mon, Dec 22, 2014 at 11:48 AM, Newman Poborsky <newman555p@gmail.com>
> wrote:
>
> On Sat, Dec 20, 2014 at 2:34 AM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>
> You can reserve hugepages on the kernel cmdline (GRUB).
>
>
> Great, thanks, I'll try that!
>
> Newman
>
>
> On Fri, Dec 19, 2014 at 12:13 PM, Newman Poborsky <newman555p@gmail.com>
> wrote:
>
>
> On Thu, Dec 18, 2014 at 9:03 PM, Ananyev, Konstantin <
> konstantin.ananyev@intel.com> wrote:
>
>
>
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Thursday, December 18, 2014 5:43 PM
> To: Newman Poborsky; dev@dpdk.org
> Subject: Re: [dpdk-dev] rte_mempool_create fails with ENOMEM
>
> Hi
>
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Newman Poborsky
> Sent: Thursday, December 18, 2014 1:26 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] rte_mempool_create fails with ENOMEM
>
> Hi,
>
> could someone please provide any explanation why sometimes mempool
>
> creation
>
> fails with ENOMEM?
>
> I run my test app several times without any problems and then I
> start
> getting ENOMEM error when creating mempool that are used for
> packets.
>
> I try
>
> to delete everything from /mnt/huge, I increase the number of huge
>
> pages,
>
> remount /mnt/huge but nothing helps.
>
> There is more than enough memory on server. I tried to debug
> rte_mempool_create() call and it seems that after server is
> restarted
>
> free
>
> mem segments are bigger than 2MB, but after running test app for
>
> several
>
> times, it seems that all free mem segments have a size of 2MB, and
>
> since I
>
> am requesting 8MB for my packet mempool, this fails.  I'm not really
>
> sure
>
> that this conclusion is correct.
>
>
> Yes,rte_mempool_create uses  rte_memzone_reserve() to allocate
> single physically continuous chunk of memory.
> If no such chunk exist, then it would fail.
> Why physically continuous?
> Main reason - to make things easier for us, as in that case we don't
>
> have to worry
>
> about situation when mbuf crosses page boundary.
> So you can overcome that problem like that:
> Allocate max amount of memory you would need to hold all mbufs in
> worst
>
> case (all pages physically disjoint)
>
> using rte_malloc().
>
>
> Actually my wrong: rte_malloc()s wouldn't help you here.
> You probably need to allocate some external (not managed by EAL) memory
> in
> that case,
> may be mmap() with MAP_HUGETLB, or something similar.
>
> Figure out it's physical mappings.
> Call  rte_mempool_xmem_create().
> You can look at: app/test-pmd/mempool_anon.c as a reference.
> It uses same approach to create mempool over 4K pages.
>
> We probably add similar function into mempool API
>
> (create_scatter_mempool or something)
>
> or just add a new flag (USE_SCATTER_MEM) into rte_mempool_create().
> Though right now it is not there.
>
> Another quick alternative - use 1G pages.
>
> Konstantin
>
>
>
>
> Ok, thanks for the explanation. I understand that this is probably an OS
> question more than DPDK, but is there a way to again allocate a contiguous
> memory for n-th run of my test app?  It seems that hugepages get
> divded/separated to individual 2MB hugepage. Shouldn't OS's memory
> management system try to group those hupages back to one contiguous chunk
> once my app/process is done?   Again, I know very little about Linux
> memory
> management and hugepages, so forgive me if this is a stupid question.
> Is rebooting the OS the only way to deal with this problem?  Or should I
> just try to use 1GB hugepages?
>
> p.s. Konstantin, sorry for the double reply, I accidentally forgot to
> include dev list in my first reply  :)
>
> Newman
>
>
>
> Does anybody have any idea what to check and how running my test app
> several times affects hugepages?
>
> For me, this doesn't make any since because after test app exits,
>
> resources
>
> should be freed, right?
>
> This has been driving me crazy for days now. I tried reading a bit
> more
> theory about hugepages, but didn't find out anything that could help
>
> me.
>
> Maybe it's something else and completely trivial, but I can't figure
> it
> out, so any help is appreciated.
>
> Thank you!
>
> BR,
> Newman P.
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> dev mailing list
> dev@dpdk.org
> http://dpdk.org/ml/listinfo/dev
>
>
> ------------------------------
>
> End of dev Digest, Vol 22, Issue 37
> ***********************************
>