From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id C4A8AA00C3; Fri, 13 May 2022 11:48:40 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9C53640E64; Fri, 13 May 2022 11:48:40 +0200 (CEST) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by mails.dpdk.org (Postfix) with ESMTP id 6949C40DDE for ; Fri, 13 May 2022 11:48:38 +0200 (CEST) Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4L03fx5WT1zGpYW; Fri, 13 May 2022 17:45:45 +0800 (CST) Received: from [127.0.0.1] (10.67.100.224) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 13 May 2022 17:48:35 +0800 Subject: Re: OVS DPDK DMA-Dev library/Design Discussion To: Bruce Richardson CC: "Pai G, Sunil" , Ilya Maximets , Radha Mohan Chintakuntla , Veerasenareddy Burru , Gagandeep Singh , Nipun Gupta , "Stokes, Ian" , "Hu, Jiayu" , "Ferriter, Cian" , "Van Haaren, Harry" , "Maxime Coquelin (maxime.coquelin@redhat.com)" , "ovs-dev@openvswitch.org" , "dev@dpdk.org" , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" References: <22e3ff73-f3d9-abae-1866-90d133af5528@ovn.org> <0633e31c-68fc-618c-e4f8-78a74662078c@ovn.org> <67043e2a-c420-7e7e-0c55-7303c6e506bc@huawei.com> From: fengchengwen Message-ID: Date: Fri, 13 May 2022 17:48:35 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.67.100.224] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022/5/13 17:10, Bruce Richardson wrote: > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote: >> On 2022/4/8 14:29, Pai G, Sunil wrote: >>>> -----Original Message----- >>>> From: Richardson, Bruce >>>> Sent: Tuesday, April 5, 2022 5:38 PM >>>> To: Ilya Maximets ; Chengwen Feng >>>> ; Radha Mohan Chintakuntla ; >>>> Veerasenareddy Burru ; Gagandeep Singh >>>> ; Nipun Gupta >>>> Cc: Pai G, Sunil ; Stokes, Ian >>>> ; Hu, Jiayu ; Ferriter, Cian >>>> ; Van Haaren, Harry ; >>>> Maxime Coquelin (maxime.coquelin@redhat.com) ; >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John >>>> ; O'Driscoll, Tim ; >>>> Finn, Emma >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >>>> >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote: >>>>> On 3/30/22 16:09, Bruce Richardson wrote: >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote: >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote: >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote: >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote: >>>>>>>>>> Forking the thread to discuss a memory consistency/ordering model. >>>>>>>>>> >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a >>>>>>>>>> completely separate PCI device. However, I don't see any memory >>>>>>>>>> ordering being enforced or even described in the dmadev API or >>>> documentation. >>>>>>>>>> Please, point me to the correct documentation, if I somehow missed >>>> it. >>>>>>>>>> >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively >>>>>>>>>> the data and the descriptor info. CPU core (C) is reading the >>>>>>>>>> descriptor and the data it points too. >>>>>>>>>> >>>>>>>>>> A few things about that process: >>>>>>>>>> >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss >>>>>>>>>> them?). Meaning that those operations can be seen by C in a >>>>>>>>>> different order regardless of barriers issued by C and >>>> regardless >>>>>>>>>> of the nature of devices A and B. >>>>>>>>>> >>>>>>>>>> 2. Even if there is a write barrier between A and B, there is >>>>>>>>>> no guarantee that C will see these writes in the same order >>>>>>>>>> as C doesn't use real memory barriers because vhost >>>>>>>>>> advertises >>>>>>>>> >>>>>>>>> s/advertises/does not advertise/ >>>>>>>>> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM. >>>>>>>>>> >>>>>>>>>> So, I'm getting to conclusion that there is a missing write >>>>>>>>>> barrier on the vhost side and vhost itself must not advertise >>>>>>>>>> the >>>>>>>>> >>>>>>>>> s/must not/must/ >>>>>>>>> >>>>>>>>> Sorry, I wrote things backwards. :) >>>>>>>>> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual >>>>>>>>>> memory barriers. >>>>>>>>>> >>>>>>>>>> Would like to hear some thoughts on that topic. Is it a real >>>> issue? >>>>>>>>>> Is it an issue considering all possible CPU architectures and >>>>>>>>>> DMA HW variants? >>>>>>>>>> >>>>>>>> >>>>>>>> In terms of ordering of operations using dmadev: >>>>>>>> >>>>>>>> * Some DMA HW will perform all operations strictly in order e.g. >>>> Intel >>>>>>>> IOAT, while other hardware may not guarantee order of >>>> operations/do >>>>>>>> things in parallel e.g. Intel DSA. Therefore the dmadev API >>>> provides the >>>>>>>> fence operation which allows the order to be enforced. The fence >>>> can be >>>>>>>> thought of as a full memory barrier, meaning no jobs after the >>>> barrier can >>>>>>>> be started until all those before it have completed. Obviously, >>>> for HW >>>>>>>> where order is always enforced, this will be a no-op, but for >>>> hardware that >>>>>>>> parallelizes, we want to reduce the fences to get best >>>> performance. >>>>>>>> >>>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can >>>> only >>>>>>>> write after a DMA copy has been done, the CPU must wait for the >>>> dma >>>>>>>> completion to guarantee ordering. Once the completion has been >>>> returned >>>>>>>> the completed operation is globally visible to all cores. >>>>>>> >>>>>>> Thanks for explanation! Some questions though: >>>>>>> >>>>>>> In our case one CPU waits for completion and another CPU is >>>>>>> actually using the data. IOW, "CPU must wait" is a bit ambiguous. >>>> Which CPU must wait? >>>>>>> >>>>>>> Or should it be "Once the completion is visible on any core, the >>>>>>> completed operation is globally visible to all cores." ? >>>>>>> >>>>>> >>>>>> The latter. >>>>>> Once the change to memory/cache is visible to any core, it is >>>>>> visible to all ones. This applies to regular CPU memory writes too - >>>>>> at least on IA, and I expect on many other architectures - once the >>>>>> write is visible outside the current core it is visible to every >>>>>> other core. Once the data hits the l1 or l2 cache of any core, any >>>>>> subsequent requests for that data from any other core will "snoop" >>>>>> the latest data from the cores cache, even if it has not made its >>>>>> way down to a shared cache, e.g. l3 on most IA systems. >>>>> >>>>> It sounds like you're referring to the "multicopy atomicity" of the >>>>> architecture. However, that is not universally supported thing. >>>>> AFAICT, POWER and older ARM systems doesn't support it, so writes >>>>> performed by one core are not necessarily available to all other cores >>>>> at the same time. That means that if the CPU0 writes the data and the >>>>> completion flag, CPU1 reads the completion flag and writes the ring, >>>>> CPU2 may see the ring write, but may still not see the write of the >>>>> data, even though there was a control dependency on CPU1. >>>>> There should be a full memory barrier on CPU1 in order to fulfill the >>>>> memory ordering requirements for CPU2, IIUC. >>>>> >>>>> In our scenario the CPU0 is a DMA device, which may or may not be part >>>>> of a CPU and may have different memory consistency/ordering >>>>> requirements. So, the question is: does DPDK DMA API guarantee >>>>> multicopy atomicity between DMA device and all CPU cores regardless of >>>>> CPU architecture and a nature of the DMA device? >>>>> >>>> >>>> Right now, it doesn't because this never came up in discussion. In order >>>> to be useful, it sounds like it explicitly should do so. At least for the >>>> Intel ioat and idxd driver cases, this will be supported, so we just need >>>> to ensure all other drivers currently upstreamed can offer this too. If >>>> they cannot, we cannot offer it as a global guarantee, and we should see >>>> about adding a capability flag for this to indicate when the guarantee is >>>> there or not. >>>> >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to >>>> document for dmadev that once a DMA operation is completed, the op is >>>> guaranteed visible to all cores/threads? If not, any thoughts on what >>>> guarantees we can provide in this regard, or what capabilities should be >>>> exposed? >>> >>> >>> >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta, >>> Requesting your valuable opinions for the queries on this thread. >> >> Sorry late for reply due I didn't follow this thread. >> >> I don't think the DMA API should provide such guarantee because: >> 1. DMA is an acceleration device, which is the same as encryption/decryption device or network device. >> 2. For Hisilicon Kunpeng platform: >> The DMA device support: >> a) IO coherency: which mean it could read read the latest data which may stay the cache, and will >> invalidate cache's data and write data to DDR when write. >> b) Order in one request: which mean it only write completion descriptor after the copy is done. >> Note: orders between multiple requests can be implemented through the fence mechanism. >> The DMA driver only should: >> a) Add one write memory barrier(use lightweight mb) when doorbell. >> So once the DMA is completed the operation is guaranteed visible to all cores, >> And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA >> start work, core-B get completion status. >> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between >> cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA >> driver with the DPDK DMA framework: >> The DMA driver should: >> a) write back DMA's src buffer, so that there are none cache data when DMA running. >> b) invalidate DMA's dst buffer >> c) do a full mb >> d) update DMA's registers. >> Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify >> it's status register to completed. >> In this case, the 3rd core will also observed the right order. >> A particular point of this is: If one buffer will shared on multiple core, application should explicit >> maintain the cache. >> >> Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even >> application(e.g. above TI's SoC)'s duty to make sure it. >> > Hi, > > thanks for that. So if I understand correctly, your current HW does provide > this guarantee, but you don't think it should be always the case for > dmadev, correct? Yes, our HW will provide the guarantee. If some HW could not provide, it's driver's and maybe application's duty to provide it. > > Based on that, what do you think should be the guarantee on completion? > Once a job is completed, the completion is visible to the submitting core, > or the core reading the completion? Do you think it's acceptable to add a Both core will visible to it. > capability flag for drivers to indicate that they do support a "globally > visible" guarantee? I think the driver (and with HW) should support "globally visible" guarantee. And for some HW, even application (or middleware) should care about it. > > Thanks, > /Bruce > > . >