From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 3FEC842B71;
	Mon, 22 May 2023 14:09:17 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id C010A40EE7;
	Mon, 22 May 2023 14:09:16 +0200 (CEST)
Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com
 [209.85.219.42]) by mails.dpdk.org (Postfix) with ESMTP id 44F2540EE5
 for <dev@dpdk.org>; Mon, 22 May 2023 14:09:15 +0200 (CEST)
Received: by mail-qv1-f42.google.com with SMTP id
 6a1803df08f44-5eee59b2a5fso2745506d6.1
 for <dev@dpdk.org>; Mon, 22 May 2023 05:09:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bytedance.com; s=google; t=1684757354; x=1687349354;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=Aw5G92965N3H5c9xQog0L2mZdRK5ZgWoMy0lGTJC+Hk=;
 b=Cg92vFpCc1G0qVtjt6ZiUVGcZhafK5OvRqGKcQrnbO40eGgWoLFEUspA1+SrhEj7ls
 dfWIZ9FX5E2RV6w/VHjcywmrrML9S/VMheI9tmsHUblnbseOvnMKr9DWTxHw9/804RhW
 bSPxwSlYMJFQC11apeujKbMLnUBXkOX9eo/1Ateayiaq0kmK7JBFRSCTx2e4GqTxyngE
 Zz8LDdghJrghYeluxL8EEsMY/IELelaKBI9yCAiFZ++Cjv0Ckk431xQz9wcwyOWbtxgr
 bvmh9MeuAEKcqh/PMPO39PQls9HFhyd/xLOMMocSthCcahz3ehf18Gu7o/xTyE6HNFaz
 Ilcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1684757354; x=1687349354;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=Aw5G92965N3H5c9xQog0L2mZdRK5ZgWoMy0lGTJC+Hk=;
 b=RjVyvqr6cpLm85q3KIhg5EodghXVXbpAH9k+igrExwzDU0gKplqIRDh3fVHnDOUHnx
 wCkDohTu3Eo/CjBlPHMfMvKI1lNlIw1SKH/UkqjdaBQiEHTs65mfySe4epxlOSl1/Ahc
 V8673TSrRKzGhyRxwYj1JSP1PxfXdiiCb9kxiVWyV5IEHEvHi9lzNAP965kNhlKEU/rp
 AlqCfwXK35YE3tmYz9I3eMrz0fTjLnJS8AfgNUcTMxhmJhJ8g5S4lSS97yZttk6fjmzv
 CbcMZ1haYNGJBVmivW+BbKqa3GB84SJng+cwB7GmNLo4WgX2oMjc7/v37aiHIGFcPCSU
 PzRQ==
X-Gm-Message-State: AC+VfDy5TR57fciOLQrnXx1YOiXrYEFoh7VwiRmPf8q66zOryO+jY7kx
 UGd0JsUqaNdzfsM+zI4vZgHFF/FU54nda8BmUC2ahA==
X-Google-Smtp-Source: ACHHUZ58zt4428MG19y3MrrKB2Re6qkVjWR3a1iE5bSXbi5OhRLsnfLaC9jV15TFtbXl9TrP+jPUcZNgzU8IamUGbS0=
X-Received: by 2002:a05:6214:411c:b0:625:88f5:7c62 with SMTP id
 kc28-20020a056214411c00b0062588f57c62mr1921963qvb.2.1684757354421; Mon, 22
 May 2023 05:09:14 -0700 (PDT)
MIME-Version: 1.0
References: <20230516122108.38617-1-changfengnan@bytedance.com>
 <a52840a8-4056-279e-ed58-55ae6696da32@intel.com>
In-Reply-To: <a52840a8-4056-279e-ed58-55ae6696da32@intel.com>
From: Fengnan Chang <changfengnan@bytedance.com>
Date: Mon, 22 May 2023 20:09:03 +0800
Message-ID: <CAPFOzZvcwkbG42ymdNEZ+CToy6RhpkyOow5gsBF5eq8Q0dOZPw@mail.gmail.com>
Subject: Re: [External] Re: [PATCH] eal: fix eal init may failed when too much
 continuous memsegs under legacy mode
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Cc: dev@dpdk.org, Lin Li <lilintjpu@bytedance.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Burakov, Anatoly <anatoly.burakov@intel.com> =E4=BA=8E2023=E5=B9=B45=E6=9C=
=8820=E6=97=A5=E5=91=A8=E5=85=AD 23:03=E5=86=99=E9=81=93=EF=BC=9A
>
> Hi,
>
> On 5/16/2023 1:21 PM, Fengnan Chang wrote:
> > Under legacy mode, if the number of continuous memsegs greater
> > than RTE_MAX_MEMSEG_PER_LIST, eal init will failed even though
> > another memseg list is empty, because only one memseg list used
> > to check in remap_needed_hugepages.
> >
> > For example:
> > hugepage configure:
> > 20480
> > 13370
> > 7110
> >
> > startup log:
> > EAL: Detected memory type: socket_id:0 hugepage_sz:2097152
> > EAL: Detected memory type: socket_id:1 hugepage_sz:2097152
> > EAL: Creating 4 segment lists: n_segs:8192 socket_id:0 hugepage_sz:2097=
152
> > EAL: Creating 4 segment lists: n_segs:8192 socket_id:1 hugepage_sz:2097=
152
> > EAL: Requesting 13370 pages of size 2MB from socket 0
> > EAL: Requesting 7110 pages of size 2MB from socket 1
> > EAL: Attempting to map 14220M on socket 1
> > EAL: Allocated 14220M on socket 1
> > EAL: Attempting to map 26740M on socket 0
> > EAL: Could not find space for memseg. Please increase 32768 and/or 6553=
6 in
> > configuration.
>
> Unrelated, but this is probably a wrong message, this should've called
> out the config options to change, not their values. Sounds like a log
> message needs fixing somewhere...

In the older version, the log is:
EAL: Could not find space for memseg. Please increase
CONFIG_RTE_MAX_MEMSEG_PER_TYPE and/or CONFIG_RTE_MAX_MEM_PER_TYPE in
configuration.
Maybe it's better ?

>
> > EAL: Couldn't remap hugepage files into memseg lists
> > EAL: FATAL: Cannot init memory
> > EAL: Cannot init memory
> >
> > Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> > Signed-off-by: Lin Li <lilintjpu@bytedance.com>
> > ---
> >   lib/eal/linux/eal_memory.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
> > index 60fc8cc6ca..36b9e78f5f 100644
> > --- a/lib/eal/linux/eal_memory.c
> > +++ b/lib/eal/linux/eal_memory.c
> > @@ -1001,6 +1001,8 @@ remap_needed_hugepages(struct hugepage_file *huge=
pages, int n_pages)
> >               if (cur->size =3D=3D 0)
> >                       break;
> >
> > +             if (cur_page - seg_start_page >=3D RTE_MAX_MEMSEG_PER_LIS=
T)
> > +                     new_memseg =3D 1;
>
> I don't think this is quite right, because technically,
> `RTE_MAX_MEMSEG_PER_LIST` is only applied to smaller page size segment
> lists - larger page sizes segment lists will hit their limits earlier.
> So, while this will work for 2MB pages, it won't work for page sizes
> which segment list length is smaller than the maximum (such as 1GB pages)=
.
>
> I think this solution could be improved upon by trying to break up the
> contiguous area instead. I suspect the core of the issue is not even the
> fact that we're exceeding limits of one memseg list, but that we're
> always attempting to map exactly N pages in `remap_hugepages`, which
> results in us leaving large contiguous zones inside memseg lists unused
> because we couldn't satisfy current allocation request and skipped to a
> new memseg list.

Correct, I didn't consider 1GB pages case, I get your point.
Thanks.
>
> For example, let's suppose we found a large contiguous area that
> would've exceeded limits of current memseg list. Sooner or later, this
> contiguous area will end, and we'll attempt to remap this virtual area
> into a memseg list. Whenever that happens, we call into the remap code,
> which will start with first segment, attempt to find exactly N number of
> free spots, fail to do so, and skip to the next segment list.
>
> Thus, sooner or later, if we get contiguous areas that are large enough,
> we will not populate our memseg lists but instead skip through them, and
> start with a new memseg list every time we need a large contiguous area.
> We prioritize having a large contiguous area over using up all of our
> memory map.
>
> If, instead, we could break up the allocation - that is, use
> `rte_fbarray_find_biggest_free()` instead of
> `rte_fbarray_find_next_n_free()`, and keep doing it until we run out of
> segment lists, we will achieve the same result your patch does, but have
> it work for all page sizes, because now we would be targeting the actual
> issue (under-utilization of memseg lists), not its symptoms (exceeding
> segment list limits for large allocations).
>
> This logic could either be inside `remap_hugepages`, or we could just
> return number of pages mapped from `remap_hugepages`, and have the
> calling code (`remap_needed_hugepages`) try again, this time with a
> different start segment, reflecting how much pages we actually mapped.
> IMO this would be easier to implement, as `remap_hugepages` is overly
> complex as it is!
>
> >               if (cur_page =3D=3D 0)
> >                       new_memseg =3D 1;
> >               else if (cur->socket_id !=3D prev->socket_id)
>
> --
> Thanks,
> Anatoly
>