From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id A64C7A32A2 for ; Thu, 24 Oct 2019 19:35:10 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 39A961EBB0; Thu, 24 Oct 2019 19:35:09 +0200 (CEST) Received: from proxy.6wind.com (host.76.145.23.62.rev.coltfrance.com [62.23.145.76]) by dpdk.org (Postfix) with ESMTP id ABE7F1EBAF for ; Thu, 24 Oct 2019 19:35:06 +0200 (CEST) Received: from glumotte.dev.6wind.com (unknown [10.16.0.195]) by proxy.6wind.com (Postfix) with ESMTP id B020C33622C; Thu, 24 Oct 2019 19:35:06 +0200 (CEST) Date: Thu, 24 Oct 2019 19:35:06 +0200 From: Olivier Matz To: Jerin Jacob Cc: Vamsi Krishna Attunuru , Andrew Rybchenko , Ferruh Yigit , "thomas@monjalon.net" , Jerin Jacob Kollanukkaran , Kiran Kumar Kokkilagadda , "anatoly.burakov@intel.com" , "stephen@networkplumber.org" , "dev@dpdk.org" Message-ID: <20191024173506.GU25286@glumotte.dev.6wind.com> References: <20191021080324.10659-3-vattunuru@marvell.com> <4bd1acf5-2da2-b2da-2b0c-7ee243d5aeb9@intel.com> <77f8eaf0-52ca-1295-973d-c8085f7b7736@intel.com> <08c426d1-6fc9-1c3f-02d4-8632a8e3c337@solarflare.com> <20191023144724.GO25286@glumotte.dev.6wind.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v11 2/4] eal: add legacy kni option X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi, On Wed, Oct 23, 2019 at 08:32:08PM +0530, Jerin Jacob wrote: > On Wed, Oct 23, 2019 at 8:17 PM Olivier Matz wrote: > > > > Hi, > > > > On Wed, Oct 23, 2019 at 03:42:39PM +0530, Jerin Jacob wrote: > > > On Tue, Oct 22, 2019 at 7:01 PM Vamsi Krishna Attunuru > > > wrote: > > > > > > > > Hi Ferruh, > > > > > > > > Can you please explain the problems in using kni dedicated mbuf alloc routines while enabling kni iova=va mode. Please see the below discussion with Andrew. He wanted to know the problems in having newer APIs. > > > > > > > > > While waiting for the Ferruh reply, I would like to summarise the > > > current status > > > > > > # In order to make KNI work with IOVA as VA, We need to make sure > > > mempool pool _object_ should not span across two huge pages > > > > > > # This problem can be fixed by, either of: > > > > > > a) Introduce a flag in mempool to define this constraint, so that, > > > when only needed, this constraint enforced and this is in line > > > with existing semantics of addressing such problems in mempool > > > > > > b) Instead of creating a flag, Make this behavior by default in > > > mempool for IOVA as VA case > > > > > > Upside: > > > b1) There is no need for specific mempool_create for KNI. > > > > > > Downside: > > > b2) Not align with existing mempool API semantics > > > b3) There will be a trivial amount of memory waste as we can not > > > allocate from the edge. Considering the normal huge > > > page memory size is 1G or 512MB this not a real issue. > > > > > > c) Make IOVA as PA when KNI kernel module is loaded > > > > > > Upside: > > > c1) Doing option (a) would call for new KNI specific mempool create > > > API i.e existing KNI applications need a one-line change in > > > application to make it work with release 19.11 or later. > > > > > > Downslide: > > > c2) Driver which needs RTE_PCI_DRV_NEED_IOVA_AS_VA can not work with KNI > > > c3) Need root privilege to run KNI as IOVA as PA need root privilege > > > > > > For the next year, we expect applications to work 19.11 without any > > > code change. My personal opinion to make go with option (a) > > > and update the release notes to document the change any it simple > > > one-line change. > > > > > > The selection of (a) vs (b) is between KNI and Mempool maintainers. > > > Could we please reach a consensus? Or can we discuss this TB meeting? > > > > > > We are going back and forth on this feature on for the last 3 > > > releases. Now that, we solved all the technical problems, please help > > > us > > > to decide (a) vs (b) to make forward progress. > > > > Thank you for the summary. > > What is not clear to me is if (a) or (b) may break an existing > > application, and if yes, in which case. > > Thanks for the reply. > > To be clear we are talking about out of tree KNI tree application. > Which they don't want to > change rte_pktmbuf_pool_create() to rte_kni_pktmbuf_pool_create() and > build for v19.11 > > So in case (b) there is no issue as It will be using rte_pktmbuf_pool_create (). > But in case of (a) it will create an issue if out of tree KNI > application is using rte_pktmbuf_pool_create() which is not using the > NEW flag. Following yesterday's discussion at techboard, I looked at the mempool code and at my previous RFC patch. It took some time to remind me what was my worries. Currently, in rte_mempool_populate_default(), when the mempool is populated, we first try to allocate one iova-contiguous block of (n * elt_size). On success, we use this memory to fully populate the mempool without taking care of crossing page boundaries. If we change the behavior to prevent objects from crossing pages, the assumption that allocating (n * elt_size) is always enough becomes wrong. By luck, there is no real impact, because if the mempool is not fully populated after this first iteration, it will allocate a new chunk. To be rigorous, we need to better calculate the amount of memory to allocate, according to page size. Looking at the code, I found another problem in the same area: let's say we populate a mempool that requires 1.1GB (and we use 1G huge pages): 1/ mempool code will first tries to allocate an iova-contiguous zone of 1.1G -> fail 2/ it then tries to allocate a page-aligned non iova-contiguous zone of 1.1G, which is 2G. On success, a lot of memory is wasted. 3/ on error, we try to allocate the biggest zone, it can still return a zone between 1.1G and 2G, which can also waste memory. I will rework my mempool patchset to properly address these issues, hopefully tomorrow. Also, I thought about another idea to solve your issue, not sure it is better but it would not imply to change the mempool behavior. If I understood the problem, when a mbuf is accross 2 pages, the copy of the data can fail in kni because the mbuf is not virtually contiguous in the kernel. So why not in this case splitting the memcpy() into several, each of them being on a single page (and calling phys2virt() for each page)? The same would have to be done when accessing the fields of the mbuf structure if it crosses a page boundary. Would that work? This could be a B plan. Olivier