From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 85836A0C4B; Thu, 21 Oct 2021 14:33:17 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1A10D41223; Thu, 21 Oct 2021 14:33:14 +0200 (CEST) Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by mails.dpdk.org (Postfix) with ESMTP id 4F0AD411FE for ; Thu, 21 Oct 2021 14:33:13 +0200 (CEST) Received: by mail-lf1-f54.google.com with SMTP id z11so574669lfj.4 for ; Thu, 21 Oct 2021 05:33:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Vuic/Yo18+M3psqxDKd9PvH/hryywAlPBHTrUlh+Oho=; b=L7UJLfble5P4jjjUjn8L6XYtlHWpDTwtrvQbbL6jTOPlFmLFajXQXKu/AkVU/pWngB 48gwBLG2iVaIDuLO1rebU9xLS3e21s33PGHf0ubMy6WnsV+N+haF79HRq7156F5Ap9+4 g/xeWFVsvhO4PzQoVRoEyAjpuJIfB1uPzlM5Ko3iy6kfwEcDnusic8/0wMiS1/VhdQvf ga0SlNbvDhfRwCX//RtYV3afKWj9C4CmjkS64E/L+t806Lek58BhUIz79AOCCunEpMDu Bg5ciuGaVznWWXwLi6F0ZCooLWi0KaaC0gDJGqQJLynA+jN4Yteax3I2rVACOKC3K8x+ Unew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Vuic/Yo18+M3psqxDKd9PvH/hryywAlPBHTrUlh+Oho=; b=xcZydM5MOPvdfxH58waoE8rkvF1z7Ll93QAouj2Tgg8OuoooT7ORzLZ8P3jE1mv8Vk /XtqsNijN+Khyc7N2HLlOKuamEk6fyu5SzArRmZo9nRvRjJoQEcQkrTRTINE2bt5BUN4 rmvc7V1eifRhYdEB3fV309KSVOnWRCya3GQtT3RdI3pq/8i+7krhR8dWMZW84iREScAq CyFGfPxIzfeHkjTJpKGqybnU89k2XRXl02H4HbIVQDHRjwnefXXI/PuUzsIIQwxWNM3Q A9PuJdOg7DmbpA5MdmRdOTNbsVGdkt6Ga/Z6fQCVDR4KoLBYSZHJMiEUdgjTpJQgETJw rZQA== X-Gm-Message-State: AOAM532y7/LBG5XQHvcLIr7NUDhJ3Xsi4wQduTNTQy46JflT0HelJ7NH Wuk/82OOngvk4xiNnUtisM8= X-Google-Smtp-Source: ABdhPJyMuLOr81he3TlnMWdOPAFJZx16xYFlhiokavHR8X41192Lh5XPkjwQc4yileW7oT4Vj46F6Q== X-Received: by 2002:ac2:5f42:: with SMTP id 2mr5010547lfz.213.1634819586478; Thu, 21 Oct 2021 05:33:06 -0700 (PDT) Received: from sovereign (broadband-37-110-65-23.ip.moscow.rt.ru. [37.110.65.23]) by smtp.gmail.com with ESMTPSA id h4sm451380lft.184.2021.10.21.05.33.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Oct 2021 05:33:06 -0700 (PDT) Date: Thu, 21 Oct 2021 15:33:05 +0300 From: Dmitry Kozlyuk To: Harman Kalra Cc: Stephen Hemminger , Thomas Monjalon , "david.marchand@redhat.com" , "dev@dpdk.org" , Ray Kinsella Message-ID: <20211021153305.664c7216@sovereign> In-Reply-To: References: <20210826145726.102081-1-hkalra@marvell.com> <20211018193707.123559-1-hkalra@marvell.com> <20211018193707.123559-3-hkalra@marvell.com> <20211018155654.0d3ffbed@hermes.local> <20211020183051.657b05c1@sovereign> X-Mailer: Claws Mail 3.18.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v3 2/7] eal/interrupts: implement get set APIs X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" 2021-10-21 09:16 (UTC+0000), Harman Kalra: > > -----Original Message----- > > From: Dmitry Kozlyuk > > Sent: Wednesday, October 20, 2021 9:01 PM > > To: Harman Kalra > > Cc: Stephen Hemminger ; Thomas > > Monjalon ; david.marchand@redhat.com; > > dev@dpdk.org; Ray Kinsella > > Subject: Re: [EXT] Re: [dpdk-dev] [PATCH v3 2/7] eal/interrupts: implement > > get set APIs > > > > > > > > > > > + /* Detect if DPDK malloc APIs are ready to be used. */ > > > > > + mem_allocator = rte_malloc_is_ready(); > > > > > + if (mem_allocator) > > > > > + intr_handle = rte_zmalloc(NULL, sizeof(struct > > > > rte_intr_handle), > > > > > + 0); > > > > > + else > > > > > + intr_handle = calloc(1, sizeof(struct rte_intr_handle)); > > > > > > > > This is problematic way to do this. > > > > The reason to use rte_malloc vs malloc should be determined by usage. > > > > > > > > If the pointer will be shared between primary/secondary process then > > > > it has to be in hugepages (ie rte_malloc). If it is not shared then > > > > then use regular malloc. > > > > > > > > But what you have done is created a method which will be a latent > > > > bug for anyone using primary/secondary process. > > > > > > > > Either: > > > > intr_handle is not allowed to be used in secondary. > > > > Then always use malloc(). > > > > Or. > > > > intr_handle can be used by both primary and secondary. > > > > Then always use rte_malloc(). > > > > Any code path that allocates intr_handle before pool is > > > > ready is broken. > > > > > > Hi Stephan, > > > > > > Till V2, I implemented this API in a way where user of the API can > > > choose If he wants intr handle to be allocated using malloc or > > > rte_malloc by passing a flag arg to the rte_intr_instanc_alloc API. > > > User of the API will best know if the intr handle is to be shared with > > secondary or not. > > > > > > But after some discussions and suggestions from the community we > > > decided to drop that flag argument and auto detect on whether > > > rte_malloc APIs are ready to be used and thereafter make all further > > allocations via rte_malloc. > > > Currently alarm subsystem (or any driver doing allocation in > > > constructor) gets interrupt instance allocated using glibc malloc that > > > too because rte_malloc* is not ready by rte_eal_alarm_init(), while > > > all further consumers gets instance allocated via rte_malloc. > > > > Just as a comment, bus scanning is the real issue, not the alarms. > > Alarms could be initialized after the memory management (but it's irrelevant > > because their handle is not accessed from the outside). > > However, MM needs to know bus IOVA requirements to initialize, which is > > usually determined by at least bus device requirements. > > > > > I think this should not cause any issue in primary/secondary model as > > > all interrupt instance pointer will be shared. > > > > What do you mean? Aren't we discussing the issue that those allocated early > > are not shared? > > > > > Infact to avoid any surprises of primary/secondary not working we > > > thought of making all allocations via rte_malloc. > > > > I don't see why anyone would not make them shared. > > In order to only use rte_malloc(), we need: > > 1. In bus drivers, move handle allocation from scan to probe stage. > > 2. In EAL, move alarm initialization to after the MM. > > It all can be done later with v3 design---but there are out-of-tree drivers. > > We need to force them to make step 1 at some point. > > I see two options: > > a) Right now have an external API that only works with rte_malloc() > > and internal API with autodetection. Fix DPDK and drop internal API. > > b) Have external API with autodetection. Fix DPDK. > > At the next ABI breakage drop autodetection and libc-malloc. > > > > > David, Thomas, Dmitry, please add if I missed anything. > > > > > > Can we please conclude on this series APIs as API freeze deadline (rc1) is > > very near. > > > > I support v3 design with no options and autodetection, because that's the > > interface we want in the end. > > Implementation can be improved later. > > Hi All, > > I came across 2 issues introduced with auto detection mechanism. > 1. In case of primary secondary model. Primary application is started which makes lots of allocations via > rte_malloc* > > Secondary side: > a. Secondary starts, in its "rte_eal_init()" it makes some allocation via rte_*, and in one of the allocation > request for heap expand is made as current memseg got exhausted. (malloc_heap_alloc_on_heap_id ()-> > alloc_more_mem_on_socket()->try_expand_heap()) > b. A request to primary for heap expand is sent. Please note secondary holds the spinlock while making > the request. (malloc_heap_alloc_on_heap_id ()->rte_spinlock_lock(&(heap->lock));) > > Primary side: > a. Primary receives the request, install a new hugepage and setups up the heap (handle_alloc_request()) > b. To inform all the secondaries about the new memseg, primary sends a sync notice where it sets up an > alarm (rte_mp_request_async ()->mp_request_async()). > c. Inside alarm setup API, we register an interrupt callback. > d. Inside rte_intr_callback_register(), a new interrupt instance allocation is requested for "src->intr_handle" > e. Since memory management is detected as up, inside "rte_intr_instance_alloc()", call to "rte_zmalloc" for > allocating memory and further inside "malloc_heap_alloc_on_heap_id()", primary will experience a deadlock > while taking up the spinlock because this spinlock is already hold by secondary. > > > 2. "eal_flags_file_prefix_autotest" is failing because the spawned process by this tests are expected to cleanup > their hugepage traces from respective directories (eg /dev/hugepage). > a. Inside eal_cleanup, rte_free()->malloc_heap_free(), where element to be freed is added to the free list and > checked if nearby elements can be joined together and form a big free chunk (malloc_elem_free()). > b. If this free chunk is big enough than the hugepage size, respective hugepage can be uninstalled after making > sure no allocation from this hugepage exists. (malloc_heap_free()->malloc_heap_free_pages()->eal_memalloc_free_seg()) > > But because of interrupt allocations made for pci intr handles (used for VFIO) and other driver specific interrupt > handles are not cleaned up in "rte_eal_cleanup()", these hugepage files are not removed and test fails. Sad to hear. But it's a great and thorough analysis. > There could be more such issues, I think we should firstly fix the DPDK. > 1. Memory management should be made independent and should be the first thing to come up in rte_eal_init() As I have explained, buses must be able to report IOVA requirement at this point (`get_iommu_class()` bus method). Either `scan()` must complete before that or `get_iommu_class()` must be able to work before `scan()` is called. > 2. rte_eal_cleanup() should be exactly opposite to rte_eal_init(), just like bus_probe, we should have bus_remove > to clean up all the memory allocations. Yes. For most buses it will be just "unplug each device". In fact, EAL could do it with `unplug()`, but it is not mandatory. > > Regarding this IRQ series, I would like to fall back to our original design i.e. rte_intr_instance_alloc() should take > an argument whether its memory should be allocated using glibc malloc or rte_malloc*. Seems there's no other option to make it on time. > Decision for allocation > (malloc or rte_malloc) can be made on fact that in the existing code is the interrupt handle is shared? > Eg. a. In case of alarm intr_handle was global entry and not confined to any structure, so this can be allocated from > normal malloc. > b. PCI device, had static entry for intr_handle inside "struct rte_pci_device" and memory for struct rte_pci_device is > via normal malloc, so it intr_handle can also be malloc'ed > c. Some driver with intr_handle inside its priv structure, and this priv structure gets allocated via rte_malloc, so > Intr_handle can also be rte_malloc. > > Later once DPDK is fixed up, this argument can be removed and all allocations can be via rte_malloc family without > any auto detection.