From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B7C8EA32A8
	for <public@inbox.dpdk.org>; Sat, 26 Oct 2019 14:25:44 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id C53FF1BF75;
	Sat, 26 Oct 2019 14:25:42 +0200 (CEST)
Received: from mail-wm1-f65.google.com (mail-wm1-f65.google.com
 [209.85.128.65]) by dpdk.org (Postfix) with ESMTP id 737301BF65
 for <dev@dpdk.org>; Sat, 26 Oct 2019 14:25:41 +0200 (CEST)
Received: by mail-wm1-f65.google.com with SMTP id n7so4861869wmc.3
 for <dev@dpdk.org>; Sat, 26 Oct 2019 05:25:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=7a3cCF+0/Zm6jnGnEYf1mMcyuxXrcpYDhxRuQZf9t8U=;
 b=fdoUxx3x/8cHTAlF0w06bSrpyFIb9I6tRU2yNrt7gm5NgnonOD53YYMNecNXdPOVYp
 n36xlwXGLLRWZUrkgSGRulkKhic/cvejsF+JTkw8Tq45C20mThPMkUhyvW7vs9BXvLKe
 R0aMiFfeIcUxKX/tgIeZoUlvVWF27dt6sWZGn2fLKeEZunKb7owVjgvai6sEO0DSgF8F
 F1Y6LblgUWjKZKWJw2lc33i3mXoZ8x6WzaLuQE1dltj3btlnG3Fz3Qg2roZ0a2t0aCIr
 Ba2bOljIJugGID8A+y4YWkqjuNVbS5kMsqKNtWMw4fG7WsZV04L1bXeNyLjw9IONt3jF
 WMAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=7a3cCF+0/Zm6jnGnEYf1mMcyuxXrcpYDhxRuQZf9t8U=;
 b=fLAsYj5AZfatOGKnrsMfLIVT1PrYdCW9Npy0ayv5vU0sgobvXIj+T/BkM4wSNTF6VY
 /32FF7LTBlJD6iPAv4HzJKYJdbnnc7JR2zUA1V9emoYcUsVn3o4umVJMGdRq4nFHL4j+
 ebI9h8qX7uu9U/qbdbZ3F6sSKYU2M56PNJlOPgyCCmaw/kv/yABzsYOSsRNnxFmixhm9
 pjhItKIZtJAGDS5W6bYJNJjnphFBdXKzQVjJSFCtEHl2VcLnG+sapfc9Hhv6dnGLAPfR
 7h44jKV/kbhLVFKFS+gc3SNRwUnhumLM5M2hErYmdf1BCvbY1FJi6kEmHRQmR08xlY31
 zlqA==
X-Gm-Message-State: APjAAAV6VxdLajC+zMpCYBdYZp0Us7g+DG+nRvuuUzf9JvrfCkctOhR8
 bQSpqIbQmhYAihph6MZ4pSnwig==
X-Google-Smtp-Source: APXvYqzbgXQ+WS0jImzDuVNN9kpBoLRtO8ePd+/VKurvzFV1E1PHArrADzWOAED5tocc69DCaQZ7Gw==
X-Received: by 2002:a05:600c:142:: with SMTP id
 w2mr7452073wmm.121.1572092741001; 
 Sat, 26 Oct 2019 05:25:41 -0700 (PDT)
Received: from 6wind.com (2a01cb0c0005a6000226b0fffeed02fc.ipv6.abo.wanadoo.fr.
 [2a01:cb0c:5:a600:226:b0ff:feed:2fc])
 by smtp.gmail.com with ESMTPSA id p126sm5677375wme.0.2019.10.26.05.25.26
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sat, 26 Oct 2019 05:25:27 -0700 (PDT)
Date: Sat, 26 Oct 2019 14:25:25 +0200
From: Olivier Matz <olivier.matz@6wind.com>
To: Vamsi Krishna Attunuru <vattunuru@marvell.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>,
 Andrew Rybchenko <arybchenko@solarflare.com>,
 Ferruh Yigit <ferruh.yigit@intel.com>,
 "thomas@monjalon.net" <thomas@monjalon.net>,
 Jerin Jacob Kollanukkaran <jerinj@marvell.com>,
 Kiran Kumar Kokkilagadda <kirankumark@marvell.com>,
 "anatoly.burakov@intel.com" <anatoly.burakov@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>,
 "dev@dpdk.org" <dev@dpdk.org>
Message-ID: <20191026122525.ny6wwtrnfw32367j@platinum>
References: <77f8eaf0-52ca-1295-973d-c8085f7b7736@intel.com>
 <MWHPR18MB1645FF9B1A7E429E23F1F881A6690@MWHPR18MB1645.namprd18.prod.outlook.com>
 <08c426d1-6fc9-1c3f-02d4-8632a8e3c337@solarflare.com>
 <MWHPR18MB1645B82FE6E75C4B8E6FE84BA6680@MWHPR18MB1645.namprd18.prod.outlook.com>
 <CALBAE1O4biRt5vcirv5febUz+7QDMynTaBAxOovgTXaiTjaVqQ@mail.gmail.com>
 <20191023144724.GO25286@glumotte.dev.6wind.com>
 <CALBAE1Og1saVEdZOvW=7PeLK6V9N_gmVnNUj2zLVby88JsfmNA@mail.gmail.com>
 <20191024173506.GU25286@glumotte.dev.6wind.com>
 <CALBAE1PN5_Qt=pP0cRZ9Bi4UKUCcvuhsMQSOQV7t2P8R35XWAQ@mail.gmail.com>
 <MWHPR18MB16450B279AC61FC5A59E7476A6650@MWHPR18MB1645.namprd18.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <MWHPR18MB16450B279AC61FC5A59E7476A6650@MWHPR18MB1645.namprd18.prod.outlook.com>
User-Agent: NeoMutt/20180716
Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v11 2/4] eal: add legacy kni option
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Jerin, Hi Vamsi,

On Fri, Oct 25, 2019 at 09:20:20AM +0000, Vamsi Krishna Attunuru wrote:
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Friday, October 25, 2019 1:01 AM
> > To: Olivier Matz <olivier.matz@6wind.com>
> > Cc: Vamsi Krishna Attunuru <vattunuru@marvell.com>; Andrew Rybchenko
> > <arybchenko@solarflare.com>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Kiran
> > Kumar Kokkilagadda <kirankumark@marvell.com>; anatoly.burakov@intel.com;
> > stephen@networkplumber.org; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v11 2/4] eal: add legacy kni option
> > 
> > On Thu, Oct 24, 2019 at 11:05 PM Olivier Matz <olivier.matz@6wind.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Oct 23, 2019 at 08:32:08PM +0530, Jerin Jacob wrote:
> > > > On Wed, Oct 23, 2019 at 8:17 PM Olivier Matz <olivier.matz@6wind.com>
> > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, Oct 23, 2019 at 03:42:39PM +0530, Jerin Jacob wrote:
> > > > > > On Tue, Oct 22, 2019 at 7:01 PM Vamsi Krishna Attunuru
> > > > > > <vattunuru@marvell.com> wrote:
> > > > > > >
> > > > > > > Hi Ferruh,
> > > > > > >
> > > > > > > Can you please explain the problems in using kni dedicated mbuf alloc
> > routines while enabling kni iova=va mode. Please see the below discussion with
> > Andrew. He wanted to know the problems in having newer APIs.
> > > > > >
> > > > > >
> > > > > > While waiting  for the Ferruh reply, I would like to summarise
> > > > > > the current status
> > > > > >
> > > > > > # In order to make KNI work with IOVA as VA, We need to make
> > > > > > sure mempool pool _object_ should not span across two huge pages
> > > > > >
> > > > > > # This problem can be fixed by, either of:
> > > > > >
> > > > > > a) Introduce a flag in mempool to define this constraint, so
> > > > > > that, when only needed, this constraint enforced and this is in
> > > > > > line with existing semantics of addressing such problems in
> > > > > > mempool
> > > > > >
> > > > > > b) Instead of creating a flag, Make this behavior by default in
> > > > > > mempool for IOVA as VA case
> > > > > >
> > > > > > Upside:
> > > > > > b1) There is no need for specific mempool_create for KNI.
> > > > > >
> > > > > > Downside:
> > > > > > b2) Not align with existing mempool API semantics
> > > > > > b3) There will be a trivial amount of memory waste as we can not
> > > > > > allocate from the edge. Considering the normal huge page memory
> > > > > > size is 1G or 512MB this not a real issue.
> > > > > >
> > > > > > c) Make IOVA as PA when KNI kernel module is loaded
> > > > > >
> > > > > > Upside:
> > > > > > c1) Doing option (a) would call for new KNI specific mempool
> > > > > > create API i.e existing KNI applications need a one-line change
> > > > > > in application to make it work with release 19.11 or later.
> > > > > >
> > > > > > Downslide:
> > > > > > c2) Driver which needs RTE_PCI_DRV_NEED_IOVA_AS_VA can not work
> > > > > > with KNI
> > > > > > c3) Need root privilege to run KNI as IOVA as PA need root
> > > > > > privilege
> > > > > >
> > > > > > For the next year, we expect applications to work 19.11 without
> > > > > > any code change. My personal opinion to make go with option (a)
> > > > > > and update the release notes to document the change any it
> > > > > > simple one-line change.
> > > > > >
> > > > > > The selection of (a) vs (b) is between KNI and Mempool maintainers.
> > > > > > Could we please reach a consensus? Or can we discuss this TB meeting?
> > > > > >
> > > > > > We are going back and forth on this feature on for the last 3
> > > > > > releases. Now that, we solved all the technical problems, please
> > > > > > help us to decide (a) vs (b) to make forward progress.
> > > > >
> > > > > Thank you for the summary.
> > > > > What is not clear to me is if (a) or (b) may break an existing
> > > > > application, and if yes, in which case.
> > > >
> > > > Thanks for the reply.
> > > >
> > > > To be clear we are talking about out of tree KNI tree application.
> > > > Which they don't want to
> > > > change rte_pktmbuf_pool_create() to rte_kni_pktmbuf_pool_create()
> > > > and build for v19.11
> > > >
> > > > So in case (b) there is no issue as It will be using rte_pktmbuf_pool_create ().
> > > > But in case of (a) it will create an issue if out of tree KNI
> > > > application is using rte_pktmbuf_pool_create() which is not using
> > > > the NEW flag.
> > >
> > > Following yesterday's discussion at techboard, I looked at the mempool
> > > code and at my previous RFC patch. It took some time to remind me what
> > > was my worries.
> > 
> > Thanks for the review Olivier.
> > 
> > Just to make sure the correct one is reviewed.
> > 
> > 1) v7 had similar issue mentioned
> > https://urldefense.proofpoint.com/v2/url?u=http-
> > 3A__patches.dpdk.org_patch_56585_&d=DwIBaQ&c=nKjWec2b6R0mOyPaz7xtf
> > Q&r=WllrYaumVkxaWjgKto6E_rtDQshhIhik2jkvzFyRhW8&m=MMwAZe76YMVHe
> > 8UcHjL4IBnfX5YvtbocwICAZGBY97A&s=mfN_afnyFm65sQYzaAg_-
> > uM9o22A5j392TdBZY-bKK4&e=

The v7 has the problem I described below: the iova-contiguous allocation
may fail because the calculated size is too small, and remaining objects
will be added in another chunk. This can happen if a fully
iova-contiguous mempool is requested (it happens for octeontx).

Also, the way page size is retrieved in
rte_mempool_op_populate_default() assume that memzones are used, which
is not correct.

> > 2) v11 addressed the review comments and you have given the Acked-by for
> > mempool change https://urldefense.proofpoint.com/v2/url?u=http-
> > 3A__patches.dpdk.org_patch_61559_&d=DwIBaQ&c=nKjWec2b6R0mOyPaz7xtf
> > Q&r=WllrYaumVkxaWjgKto6E_rtDQshhIhik2jkvzFyRhW8&m=MMwAZe76YMVHe
> > 8UcHjL4IBnfX5YvtbocwICAZGBY97A&s=frFvKOHFDRhTam6jDZZc6omK2gb1RU62
> > xzAiiBMnf0I&e=

The v11 looked fine to me, because it does not impact any default
behavior, and the plan was to remove this new function when my mempool
patchset was in.


> > 
> > My thought process in the TB meeting was, since
> > rte_mempool_populate_from_pg_sz_chunks() reviwed replace
> > rte_pktmbuf_pool_create's  rte_mempool_populate_default() with
> > rte_mempool_populate_from_pg_sz_chunks()
> > in IOVA == VA case to avoid a new KNI mempool_create API.
> > 
> > >
> > > Currently, in rte_mempool_populate_default(), when the mempool is
> > > populated, we first try to allocate one iova-contiguous block of (n *
> > > elt_size). On success, we use this memory to fully populate the
> > > mempool without taking care of crossing page boundaries.
> > >
> > > If we change the behavior to prevent objects from crossing pages, the
> > > assumption that allocating (n * elt_size) is always enough becomes
> > > wrong.  By luck, there is no real impact, because if the mempool is
> > > not fully populated after this first iteration, it will allocate a new
> > > chunk.
> > >
> > > To be rigorous, we need to better calculate the amount of memory to
> > > allocate, according to page size.
> 
> Hi Olivier,
> 
> Thanks for the review, I think the below mentioned problems exist with 
> current mempool_populate_default() api and will there be high chances of hitting 
> those problems when we precalculate the memory size(after preventing objs from 
> pg boundary and fit complete mempool memory in single mem chunk) and if
> mempool size goes beyond page size as below example. ?,  

Yes, the problem described below (alloc a mempool of 1.1GB resulting in
2GB reserved) exists in the current version. It will be fixed in the new
version of my "mempool: avoid objects allocations across pages"
patchset.

FYI, a reworked patchset is alsmost ready, I'll send it monday.

> 
> Regards,
> Vamsi
> 
> > >
> > > Looking at the code, I found another problem in the same area: let's
> > > say we populate a mempool that requires 1.1GB (and we use 1G huge pages):
> > >	
> > > 1/ mempool code will first tries to allocate an iova-contiguous zone
> > >    of 1.1G -> fail
> > > 2/ it then tries to allocate a page-aligned non iova-contiguous
> > >    zone of 1.1G, which is 2G. On success, a lot of memory is wasted.
> > > 3/ on error, we try to allocate the biggest zone, it can still return
> > >    a zone between 1.1G and 2G, which can also waste memory.
> > >
> > > I will rework my mempool patchset to properly address these issues,
> > > hopefully tomorrow.
> > 
> > OK.
> > 
> > 
> > >
> > > Also, I thought about another idea to solve your issue, not sure it is
> > > better but it would not imply to change the mempool behavior. If I
> > > understood the problem, when a mbuf is accross 2 pages, the copy of
> > > the data can fail in kni because the mbuf is not virtually contiguous
> > > in the
> > 
> > For KNI use case, we would need _physically_ contiguous to make sure that
> > using, get_user_pages_remote() we get  physically contiguous memory map, so
> > that both KNI kernel thread and KNI kernel context and DPDK userspace can use
> > the same memory in different contexts.
> > 
> > 
> > 
> > > kernel. So why not in this case splitting the memcpy() into several,
> > > each of them being on a single page (and calling phys2virt() for each
> > > page)? The same would have to be done when accessing the fields of the
> > > mbuf structure if it crosses a page boundary. Would that work? This
> > 
> > If the above is the requirement, Does this logic need to be in slow path or fast
> > path?

In fast path. But I don't think the performance impact would be
significative.

Vamsi, do you confirm this approach could also solve the issue without
changing the mempool?

> > 
> > > could be a B plan.
> > 
> > OK.
> > 
> > >
> > > Olivier