From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8CF6EA00BE; Mon, 16 May 2022 11:04:37 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 39C3140A7A; Mon, 16 May 2022 11:04:36 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id C7C8B40A79 for ; Mon, 16 May 2022 11:04:34 +0200 (CEST) Content-class: urn:content-classes:message Subject: RE: OVS DPDK DMA-Dev library/Design Discussion MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Mon, 16 May 2022 11:04:31 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87075@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MimeOLE: Produced By Microsoft Exchange V6.5 X-MS-TNEF-Correlator: Thread-Topic: OVS DPDK DMA-Dev library/Design Discussion Thread-Index: AdhmtQXNA+r7H6ljQgq60te4T+mXGACTU+oQ References: <0633e31c-68fc-618c-e4f8-78a74662078c@ovn.org> <67043e2a-c420-7e7e-0c55-7303c6e506bc@huawei.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" , "fengchengwen" Cc: "Pai G, Sunil" , "Ilya Maximets" , "Radha Mohan Chintakuntla" , "Veerasenareddy Burru" , "Gagandeep Singh" , "Nipun Gupta" , "Stokes, Ian" , "Hu, Jiayu" , "Ferriter, Cian" , "Van Haaren, Harry" , , , , "Mcnamara, John" , "O'Driscoll, Tim" , "Finn, Emma" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Friday, 13 May 2022 12.34 >=20 > On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote: > > On 2022/5/13 17:10, Bruce Richardson wrote: > > > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote: > > >> On 2022/4/8 14:29, Pai G, Sunil wrote: > > >>>> -----Original Message----- > > >>>> From: Richardson, Bruce > > >>>> Sent: Tuesday, April 5, 2022 5:38 PM > > >>>> To: Ilya Maximets ; Chengwen Feng > > >>>> ; Radha Mohan Chintakuntla > ; > > >>>> Veerasenareddy Burru ; Gagandeep Singh > > >>>> ; Nipun Gupta > > >>>> Cc: Pai G, Sunil ; Stokes, Ian > > >>>> ; Hu, Jiayu ; > Ferriter, Cian > > >>>> ; Van Haaren, Harry > ; > > >>>> Maxime Coquelin (maxime.coquelin@redhat.com) > ; > > >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John > > >>>> ; O'Driscoll, Tim > ; > > >>>> Finn, Emma > > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > >>>> > > >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote: > > >>>>> On 3/30/22 16:09, Bruce Richardson wrote: > > >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets = wrote: > > >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote: > > >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets > wrote: > > >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote: > > >>>>>>>>>> Forking the thread to discuss a memory > consistency/ordering model. > > >>>>>>>>>> > > >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a > > >>>>>>>>>> completely separate PCI device. However, I don't see any > memory > > >>>>>>>>>> ordering being enforced or even described in the dmadev > API or > > >>>> documentation. > > >>>>>>>>>> Please, point me to the correct documentation, if I > somehow missed > > >>>> it. > > >>>>>>>>>> > > >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing > respectively > > >>>>>>>>>> the data and the descriptor info. CPU core (C) is = reading > the > > >>>>>>>>>> descriptor and the data it points too. > > >>>>>>>>>> > > >>>>>>>>>> A few things about that process: > > >>>>>>>>>> > > >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did > I miss > > >>>>>>>>>> them?). Meaning that those operations can be seen by = C > in a > > >>>>>>>>>> different order regardless of barriers issued by C and > > >>>> regardless > > >>>>>>>>>> of the nature of devices A and B. > > >>>>>>>>>> > > >>>>>>>>>> 2. Even if there is a write barrier between A and B, = there > is > > >>>>>>>>>> no guarantee that C will see these writes in the same > order > > >>>>>>>>>> as C doesn't use real memory barriers because vhost > > >>>>>>>>>> advertises > > >>>>>>>>> > > >>>>>>>>> s/advertises/does not advertise/ > > >>>>>>>>> > > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM. > > >>>>>>>>>> > > >>>>>>>>>> So, I'm getting to conclusion that there is a missing > write > > >>>>>>>>>> barrier on the vhost side and vhost itself must not > advertise > > >>>>>>>>>> the > > >>>>>>>>> > > >>>>>>>>> s/must not/must/ > > >>>>>>>>> > > >>>>>>>>> Sorry, I wrote things backwards. :) > > >>>>>>>>> > > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use > actual > > >>>>>>>>>> memory barriers. > > >>>>>>>>>> > > >>>>>>>>>> Would like to hear some thoughts on that topic. Is it a > real > > >>>> issue? > > >>>>>>>>>> Is it an issue considering all possible CPU architectures > and > > >>>>>>>>>> DMA HW variants? > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>>> In terms of ordering of operations using dmadev: > > >>>>>>>> > > >>>>>>>> * Some DMA HW will perform all operations strictly in order > e.g. > > >>>> Intel > > >>>>>>>> IOAT, while other hardware may not guarantee order of > > >>>> operations/do > > >>>>>>>> things in parallel e.g. Intel DSA. Therefore the dmadev > API > > >>>> provides the > > >>>>>>>> fence operation which allows the order to be enforced. = The > fence > > >>>> can be > > >>>>>>>> thought of as a full memory barrier, meaning no jobs = after > the > > >>>> barrier can > > >>>>>>>> be started until all those before it have completed. > Obviously, > > >>>> for HW > > >>>>>>>> where order is always enforced, this will be a no-op, but > for > > >>>> hardware that > > >>>>>>>> parallelizes, we want to reduce the fences to get best > > >>>> performance. > > >>>>>>>> > > >>>>>>>> * For synchronization between DMA devices and CPUs, where a > CPU can > > >>>> only > > >>>>>>>> write after a DMA copy has been done, the CPU must wait > for the > > >>>> dma > > >>>>>>>> completion to guarantee ordering. Once the completion has > been > > >>>> returned > > >>>>>>>> the completed operation is globally visible to all cores. > > >>>>>>> > > >>>>>>> Thanks for explanation! Some questions though: > > >>>>>>> > > >>>>>>> In our case one CPU waits for completion and another CPU is > > >>>>>>> actually using the data. IOW, "CPU must wait" is a bit > ambiguous. > > >>>> Which CPU must wait? > > >>>>>>> > > >>>>>>> Or should it be "Once the completion is visible on any core, > the > > >>>>>>> completed operation is globally visible to all cores." ? > > >>>>>>> > > >>>>>> > > >>>>>> The latter. > > >>>>>> Once the change to memory/cache is visible to any core, it is > > >>>>>> visible to all ones. This applies to regular CPU memory = writes > too - > > >>>>>> at least on IA, and I expect on many other architectures - > once the > > >>>>>> write is visible outside the current core it is visible to > every > > >>>>>> other core. Once the data hits the l1 or l2 cache of any = core, > any > > >>>>>> subsequent requests for that data from any other core will > "snoop" > > >>>>>> the latest data from the cores cache, even if it has not made > its > > >>>>>> way down to a shared cache, e.g. l3 on most IA systems. > > >>>>> > > >>>>> It sounds like you're referring to the "multicopy atomicity" = of > the > > >>>>> architecture. However, that is not universally supported > thing. > > >>>>> AFAICT, POWER and older ARM systems doesn't support it, so > writes > > >>>>> performed by one core are not necessarily available to all > other cores > > >>>>> at the same time. That means that if the CPU0 writes the data > and the > > >>>>> completion flag, CPU1 reads the completion flag and writes the > ring, > > >>>>> CPU2 may see the ring write, but may still not see the write = of > the > > >>>>> data, even though there was a control dependency on CPU1. > > >>>>> There should be a full memory barrier on CPU1 in order to > fulfill the > > >>>>> memory ordering requirements for CPU2, IIUC. > > >>>>> > > >>>>> In our scenario the CPU0 is a DMA device, which may or may not > be part > > >>>>> of a CPU and may have different memory consistency/ordering > > >>>>> requirements. So, the question is: does DPDK DMA API = guarantee > > >>>>> multicopy atomicity between DMA device and all CPU cores > regardless of > > >>>>> CPU architecture and a nature of the DMA device? > > >>>>> > > >>>> > > >>>> Right now, it doesn't because this never came up in discussion. > In order > > >>>> to be useful, it sounds like it explicitly should do so. At > least for the > > >>>> Intel ioat and idxd driver cases, this will be supported, so we > just need > > >>>> to ensure all other drivers currently upstreamed can offer this > too. If > > >>>> they cannot, we cannot offer it as a global guarantee, and we > should see > > >>>> about adding a capability flag for this to indicate when the > guarantee is > > >>>> there or not. > > >>>> > > >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok > to > > >>>> document for dmadev that once a DMA operation is completed, the > op is > > >>>> guaranteed visible to all cores/threads? If not, any thoughts = on > what > > >>>> guarantees we can provide in this regard, or what capabilities > should be > > >>>> exposed? > > >>> > > >>> > > >>> > > >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy > Burru, @Gagandeep Singh, @Nipun Gupta, > > >>> Requesting your valuable opinions for the queries on this = thread. > > >> > > >> Sorry late for reply due I didn't follow this thread. > > >> > > >> I don't think the DMA API should provide such guarantee because: > > >> 1. DMA is an acceleration device, which is the same as > encryption/decryption device or network device. > > >> 2. For Hisilicon Kunpeng platform: > > >> The DMA device support: > > >> a) IO coherency: which mean it could read read the latest > data which may stay the cache, and will > > >> invalidate cache's data and write data to DDR when write. > > >> b) Order in one request: which mean it only write completion > descriptor after the copy is done. > > >> Note: orders between multiple requests can be implemented > through the fence mechanism. > > >> The DMA driver only should: > > >> a) Add one write memory barrier(use lightweight mb) when > doorbell. > > >> So once the DMA is completed the operation is guaranteed > visible to all cores, > > >> And the 3rd core will observed the right order: core-B prepare > data and issue request to DMA, DMA > > >> start work, core-B get completion status. > > >> 3. I did a TI multi-core SoC many years ago, the SoC don't = support > cache coherence and consistency between > > >> cores. The SoC also have DMA device which have many channel. > Here we do a hypothetical design the DMA > > >> driver with the DPDK DMA framework: > > >> The DMA driver should: > > >> a) write back DMA's src buffer, so that there are none cache > data when DMA running. > > >> b) invalidate DMA's dst buffer > > >> c) do a full mb > > >> d) update DMA's registers. > > >> Then DMA will execute the copy task, it copy from DDR and = write > to DDR, and after copy it will modify > > >> it's status register to completed. > > >> In this case, the 3rd core will also observed the right order. > > >> A particular point of this is: If one buffer will shared on > multiple core, application should explicit > > >> maintain the cache. > > >> > > >> Based on above, I don't think the DMA API should explicit add the > descriptor, it's driver's and even > > >> application(e.g. above TI's SoC)'s duty to make sure it. > > >> > > > Hi, > > > > > > thanks for that. So if I understand correctly, your current HW = does > provide > > > this guarantee, but you don't think it should be always the case > for > > > dmadev, correct? > > > > Yes, our HW will provide the guarantee. > > If some HW could not provide, it's driver's and maybe application's > duty to provide it. > > > > > > > > Based on that, what do you think should be the guarantee on > completion? > > > Once a job is completed, the completion is visible to the > submitting core, > > > or the core reading the completion? Do you think it's acceptable = to > add a > > > > Both core will visible to it. > > > > > capability flag for drivers to indicate that they do support a > "globally > > > visible" guarantee? > > > > I think the driver (and with HW) should support "globally visible" > guarantee. > > And for some HW, even application (or middleware) should care about > it. > > >=20 > From a dmadev API viewpoint, whether the driver handles it or the HW > itself, does not matter. However, if the application needs to take > special > actions to guarantee visibility, then that needs to be flagged as part > of > the dmadev API. >=20 > I see three possibilities: > 1 Wait until we have a driver that does not have global visibility on > return from rte_dma_completed, and at that point add a flag > indicating > the lack of that support. Until then, document that results of ops > will > be globally visible. > 2 Add a flag now to allow drivers to indicate *lack* of global > visibility, > and document that results are visible unless flag is set. > 3 Add a flag now to allow drivers call out that all results are g.v., > and > update drivers to use this flag. >=20 > I would be very much in favour of #1, because: > * YAGNI principle - (subject to confirmation by other maintainers) if > we > don't have a driver right now that needs non-g.v. behaviour we may > never > need one. > * In the absence of a concrete case where g.v. is not guaranteed, we > may > struggle to document correctly what the actual guarantees are, > especially if > submitter core and completer core are different. A big +1 to that! Perhaps the documentation can reflect that global visibility is provided = by current DMA hardware, and if some future DMA hardware does not = provide it, the API will be changed (in some unspecified manner) to = reflect this. My point is: We should avoid that the API stability policy = makes it impossible to add some future DMA hardware without global = visibility. Requiring applications to handle such future DMA hardware = differently is perfectly fine; the API (and its documentation) should = just be open for it. >=20 > @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, > @Nipun Gupta, > As driver maintainers, can you please confirm if on receipt of a > completion > from HW/driver, the operation results are visible on all application > cores, > i.e. the app does not need additional barriers to propagate visibility > to > other cores. Your opinions on this discussion would also be useful. >=20 > Regards, > /Bruce