From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 62DD52B9C for ; Fri, 23 Mar 2018 16:44:48 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 Mar 2018 08:44:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,350,1517904000"; d="scan'208";a="28260244" Received: from tanjianf-mobl.ccr.corp.intel.com (HELO [10.255.30.84]) ([10.255.30.84]) by orsmga006.jf.intel.com with ESMTP; 23 Mar 2018 08:44:43 -0700 To: Anatoly Burakov , dev@dpdk.org References: <98bda79f2d4552ca524a37d7309590c766a77871.1520428025.git.anatoly.burakov@intel.com> Cc: Bruce Richardson , keith.wiles@intel.com, andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, benjamin.walker@intel.com, thomas@monjalon.net, konstantin.ananyev@intel.com, kuralamudhan.ramakrishnan@intel.com, louise.m.daly@intel.com, nelio.laranjeiro@6wind.com, yskoh@mellanox.com, pepperjo@japf.ch, jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com, olivier.matz@6wind.com From: "Tan, Jianfeng" Message-ID: Date: Fri, 23 Mar 2018 23:44:43 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <98bda79f2d4552ca524a37d7309590c766a77871.1520428025.git.anatoly.burakov@intel.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v2 28/41] eal: add support for multiprocess memory hotplug X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Mar 2018 15:44:48 -0000 On 3/8/2018 12:56 AM, Anatoly Burakov wrote: > This enables multiprocess synchronization for memory hotplug > requests at runtime (as opposed to initialization). > > Basic workflow is the following. Primary process always does initial > mapping and unmapping, and secondary processes always follow primary > page map. Only one allocation request can be active at any one time. > > When primary allocates memory, it ensures that all other processes > have allocated the same set of hugepages successfully, otherwise > any allocations made are being rolled back, and heap is freed back. > Heap is locked throughout the process, so no race conditions can > happen. > > When primary frees memory, it frees the heap, deallocates affected > pages, and notifies other processes of deallocations. Since heap is > freed from that memory chunk, the area basically becomes invisible > to other processes even if they happen to fail to unmap that > specific set of pages, so it's completely safe to ignore results of > sync requests. > > When secondary allocates memory, it does not do so by itself. > Instead, it sends a request to primary process to try and allocate > pages of specified size and on specified socket, such that a > specified heap allocation request could complete. Primary process > then sends all secondaries (including the requestor) a separate > notification of allocated pages, and expects all secondary > processes to report success before considering pages as "allocated". > > Only after primary process ensures that all memory has been > successfully allocated in all secondary process, it will respond > positively to the initial request, and let secondary proceed with > the allocation. Since the heap now has memory that can satisfy > allocation request, and it was locked all this time (so no other > allocations could take place), secondary process will be able to > allocate memory from the heap. > > When secondary frees memory, it hides pages to be deallocated from > the heap. Then, it sends a deallocation request to primary process, > so that it deallocates pages itself, and then sends a separate sync > request to all other processes (including the requestor) to unmap > the same pages. This way, even if secondary fails to notify other > processes of this deallocation, that memory will become invisible > to other processes, and will not be allocated from again. > > So, to summarize: address space will only become part of the heap > if primary process can ensure that all other processes have > allocated this memory successfully. If anything goes wrong, the > worst thing that could happen is that a page will "leak" and will > not be available to neither DPDK nor the system, as some process > will still hold onto it. It's not an actual leak, as we can account > for the page - it's just that none of the processes will be able > to use this page for anything useful, until it gets allocated from > by the primary. > > Due to underlying DPDK IPC implementation being single-threaded, > some asynchronous magic had to be done, as we need to complete > several requests before we can definitively allow secondary process > to use allocated memory (namely, it has to be present in all other > secondary processes before it can be used). Additionally, only > one allocation request is allowed to be submitted at once. > > Memory allocation requests are only allowed when there are no > secondary processes currently initializing. To enforce that, > a shared rwlock is used, that is set to read lock on init (so that > several secondaries could initialize concurrently), and write lock > on making allocation requests (so that either secondary init will > have to wait, or allocation request will have to wait until all > processes have initialized). > > Signed-off-by: Anatoly Burakov > --- > > Notes: > v2: - fixed deadlocking on init problem > - reverted rte_panic changes (fixed by changes in IPC instead) > > This problem is evidently complex to solve without multithreaded > IPC implementation. An alternative approach would be to process > each individual message in its own thread (or at least spawn a > thread per incoming request) - that way, we can send requests > while responding to another request, and this problem becomes > trivial to solve (and in fact it was solved that way initially, > before my aversion to certain other programming languages kicked > in). > > Is the added complexity worth saving a couple of thread spin-ups > here and there? > > lib/librte_eal/bsdapp/eal/Makefile | 1 + > lib/librte_eal/common/eal_common_memory.c | 16 +- > lib/librte_eal/common/include/rte_eal_memconfig.h | 3 + > lib/librte_eal/common/malloc_heap.c | 255 ++++++-- > lib/librte_eal/common/malloc_mp.c | 723 ++++++++++++++++++++++ > lib/librte_eal/common/malloc_mp.h | 86 +++ > lib/librte_eal/common/meson.build | 1 + > lib/librte_eal/linuxapp/eal/Makefile | 1 + > 8 files changed, 1040 insertions(+), 46 deletions(-) > create mode 100644 lib/librte_eal/common/malloc_mp.c > create mode 100644 lib/librte_eal/common/malloc_mp.h ... > +/* callback for asynchronous sync requests for primary. this will either do a > + * sendmsg with results, or trigger rollback request. > + */ > +static int > +handle_sync_response(const struct rte_mp_msg *request, Rename to handle_async_response()?