From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 8CF6EA00BE;
	Mon, 16 May 2022 11:04:37 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 39C3140A7A;
	Mon, 16 May 2022 11:04:36 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id C7C8B40A79
 for <dev@dpdk.org>; Mon, 16 May 2022 11:04:34 +0200 (CEST)
Content-class: urn:content-classes:message
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 16 May 2022 11:04:31 +0200
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87075@smartserver.smartshare.dk>
In-Reply-To: <Yn40JmjvB6KD12lP@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MimeOLE: Produced By Microsoft Exchange V6.5
X-MS-TNEF-Correlator: 
Thread-Topic: OVS DPDK DMA-Dev library/Design Discussion
Thread-Index: AdhmtQXNA+r7H6ljQgq60te4T+mXGACTU+oQ
References: <a095984c-bdee-8f7b-583e-034ac1165497@ovn.org>
 <YkQ7Mz6l0JIHW8Gh@bricha3-MOBL.ger.corp.intel.com>
 <ea799746-090c-c7a8-ed37-78f8f88e0b8c@ovn.org>
 <YkRkmuou9773bMf/@bricha3-MOBL.ger.corp.intel.com>
 <0633e31c-68fc-618c-e4f8-78a74662078c@ovn.org>
 <YkwxFZUqOfnL9cJC@bricha3-MOBL.ger.corp.intel.com>
 <CO6PR11MB5603A9314AE76184512877BDBDE99@CO6PR11MB5603.namprd11.prod.outlook.com>
 <67043e2a-c420-7e7e-0c55-7303c6e506bc@huawei.com>
 <Yn4gaY2IPmDuCq5V@bricha3-MOBL.ger.corp.intel.com>
 <a993ec37-323e-2e00-a423-3ecfbc3e7b35@huawei.com>
 <Yn40JmjvB6KD12lP@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>,
 "fengchengwen" <fengchengwen@huawei.com>
Cc: "Pai G, Sunil" <sunil.pai.g@intel.com>,
 "Ilya Maximets" <i.maximets@ovn.org>,
 "Radha Mohan Chintakuntla" <radhac@marvell.com>,
 "Veerasenareddy Burru" <vburru@marvell.com>,
 "Gagandeep Singh" <g.singh@nxp.com>, "Nipun Gupta" <nipun.gupta@nxp.com>,
 "Stokes, Ian" <ian.stokes@intel.com>, "Hu, Jiayu" <jiayu.hu@intel.com>,
 "Ferriter, Cian" <cian.ferriter@intel.com>,
 "Van Haaren, Harry" <harry.van.haaren@intel.com>,
 <maxime.coquelin@redhat.com>, <ovs-dev@openvswitch.org>, <dev@dpdk.org>,
 "Mcnamara, John" <john.mcnamara@intel.com>,
 "O'Driscoll, Tim" <tim.odriscoll@intel.com>,
 "Finn, Emma" <emma.finn@intel.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 13 May 2022 12.34
>=20
> On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote:
> > On 2022/5/13 17:10, Bruce Richardson wrote:
> > > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
> > >> On 2022/4/8 14:29, Pai G, Sunil wrote:
> > >>>> -----Original Message-----
> > >>>> From: Richardson, Bruce <bruce.richardson@intel.com>
> > >>>> Sent: Tuesday, April 5, 2022 5:38 PM
> > >>>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> > >>>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla
> <radhac@marvell.com>;
> > >>>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> > >>>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> > >>>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> > >>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>;
> Ferriter, Cian
> > >>>> <cian.ferriter@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>;
> > >>>> Maxime Coquelin (maxime.coquelin@redhat.com)
> <maxime.coquelin@redhat.com>;
> > >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim
> <tim.odriscoll@intel.com>;
> > >>>> Finn, Emma <emma.finn@intel.com>
> > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>
> > >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> > >>>>> On 3/30/22 16:09, Bruce Richardson wrote:
> > >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets =
wrote:
> > >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
> > >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets
> wrote:
> > >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
> > >>>>>>>>>> Forking the thread to discuss a memory
> consistency/ordering model.
> > >>>>>>>>>>
> > >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
> > >>>>>>>>>> completely separate PCI device.  However, I don't see any
> memory
> > >>>>>>>>>> ordering being enforced or even described in the dmadev
> API or
> > >>>> documentation.
> > >>>>>>>>>> Please, point me to the correct documentation, if I
> somehow missed
> > >>>> it.
> > >>>>>>>>>>
> > >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing
> respectively
> > >>>>>>>>>> the data and the descriptor info.  CPU core (C) is =
reading
> the
> > >>>>>>>>>> descriptor and the data it points too.
> > >>>>>>>>>>
> > >>>>>>>>>> A few things about that process:
> > >>>>>>>>>>
> > >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did
> I miss
> > >>>>>>>>>>    them?).  Meaning that those operations can be seen by =
C
> in a
> > >>>>>>>>>>    different order regardless of barriers issued by C and
> > >>>> regardless
> > >>>>>>>>>>    of the nature of devices A and B.
> > >>>>>>>>>>
> > >>>>>>>>>> 2. Even if there is a write barrier between A and B, =
there
> is
> > >>>>>>>>>>    no guarantee that C will see these writes in the same
> order
> > >>>>>>>>>>    as C doesn't use real memory barriers because vhost
> > >>>>>>>>>> advertises
> > >>>>>>>>>
> > >>>>>>>>> s/advertises/does not advertise/
> > >>>>>>>>>
> > >>>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
> > >>>>>>>>>>
> > >>>>>>>>>> So, I'm getting to conclusion that there is a missing
> write
> > >>>>>>>>>> barrier on the vhost side and vhost itself must not
> advertise
> > >>>>>>>>>> the
> > >>>>>>>>>
> > >>>>>>>>> s/must not/must/
> > >>>>>>>>>
> > >>>>>>>>> Sorry, I wrote things backwards. :)
> > >>>>>>>>>
> > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use
> actual
> > >>>>>>>>>> memory barriers.
> > >>>>>>>>>>
> > >>>>>>>>>> Would like to hear some thoughts on that topic.  Is it a
> real
> > >>>> issue?
> > >>>>>>>>>> Is it an issue considering all possible CPU architectures
> and
> > >>>>>>>>>> DMA HW variants?
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> In terms of ordering of operations using dmadev:
> > >>>>>>>>
> > >>>>>>>> * Some DMA HW will perform all operations strictly in order
> e.g.
> > >>>> Intel
> > >>>>>>>>   IOAT, while other hardware may not guarantee order of
> > >>>> operations/do
> > >>>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev
> API
> > >>>> provides the
> > >>>>>>>>   fence operation which allows the order to be enforced. =
The
> fence
> > >>>> can be
> > >>>>>>>>   thought of as a full memory barrier, meaning no jobs =
after
> the
> > >>>> barrier can
> > >>>>>>>>   be started until all those before it have completed.
> Obviously,
> > >>>> for HW
> > >>>>>>>>   where order is always enforced, this will be a no-op, but
> for
> > >>>> hardware that
> > >>>>>>>>   parallelizes, we want to reduce the fences to get best
> > >>>> performance.
> > >>>>>>>>
> > >>>>>>>> * For synchronization between DMA devices and CPUs, where a
> CPU can
> > >>>> only
> > >>>>>>>>   write after a DMA copy has been done, the CPU must wait
> for the
> > >>>> dma
> > >>>>>>>>   completion to guarantee ordering. Once the completion has
> been
> > >>>> returned
> > >>>>>>>>   the completed operation is globally visible to all cores.
> > >>>>>>>
> > >>>>>>> Thanks for explanation!  Some questions though:
> > >>>>>>>
> > >>>>>>> In our case one CPU waits for completion and another CPU is
> > >>>>>>> actually using the data.  IOW, "CPU must wait" is a bit
> ambiguous.
> > >>>> Which CPU must wait?
> > >>>>>>>
> > >>>>>>> Or should it be "Once the completion is visible on any core,
> the
> > >>>>>>> completed operation is globally visible to all cores." ?
> > >>>>>>>
> > >>>>>>
> > >>>>>> The latter.
> > >>>>>> Once the change to memory/cache is visible to any core, it is
> > >>>>>> visible to all ones. This applies to regular CPU memory =
writes
> too -
> > >>>>>> at least on IA, and I expect on many other architectures -
> once the
> > >>>>>> write is visible outside the current core it is visible to
> every
> > >>>>>> other core. Once the data hits the l1 or l2 cache of any =
core,
> any
> > >>>>>> subsequent requests for that data from any other core will
> "snoop"
> > >>>>>> the latest data from the cores cache, even if it has not made
> its
> > >>>>>> way down to a shared cache, e.g. l3 on most IA systems.
> > >>>>>
> > >>>>> It sounds like you're referring to the "multicopy atomicity" =
of
> the
> > >>>>> architecture.  However, that is not universally supported
> thing.
> > >>>>> AFAICT, POWER and older ARM systems doesn't support it, so
> writes
> > >>>>> performed by one core are not necessarily available to all
> other cores
> > >>>>> at the same time.  That means that if the CPU0 writes the data
> and the
> > >>>>> completion flag, CPU1 reads the completion flag and writes the
> ring,
> > >>>>> CPU2 may see the ring write, but may still not see the write =
of
> the
> > >>>>> data, even though there was a control dependency on CPU1.
> > >>>>> There should be a full memory barrier on CPU1 in order to
> fulfill the
> > >>>>> memory ordering requirements for CPU2, IIUC.
> > >>>>>
> > >>>>> In our scenario the CPU0 is a DMA device, which may or may not
> be part
> > >>>>> of a CPU and may have different memory consistency/ordering
> > >>>>> requirements.  So, the question is: does DPDK DMA API =
guarantee
> > >>>>> multicopy atomicity between DMA device and all CPU cores
> regardless of
> > >>>>> CPU architecture and a nature of the DMA device?
> > >>>>>
> > >>>>
> > >>>> Right now, it doesn't because this never came up in discussion.
> In order
> > >>>> to be useful, it sounds like it explicitly should do so. At
> least for the
> > >>>> Intel ioat and idxd driver cases, this will be supported, so we
> just need
> > >>>> to ensure all other drivers currently upstreamed can offer this
> too. If
> > >>>> they cannot, we cannot offer it as a global guarantee, and we
> should see
> > >>>> about adding a capability flag for this to indicate when the
> guarantee is
> > >>>> there or not.
> > >>>>
> > >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok
> to
> > >>>> document for dmadev that once a DMA operation is completed, the
> op is
> > >>>> guaranteed visible to all cores/threads? If not, any thoughts =
on
> what
> > >>>> guarantees we can provide in this regard, or what capabilities
> should be
> > >>>> exposed?
> > >>>
> > >>>
> > >>>
> > >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy
> Burru, @Gagandeep Singh, @Nipun Gupta,
> > >>> Requesting your valuable opinions for the queries on this =
thread.
> > >>
> > >> Sorry late for reply due I didn't follow this thread.
> > >>
> > >> I don't think the DMA API should provide such guarantee because:
> > >> 1. DMA is an acceleration device, which is the same as
> encryption/decryption device or network device.
> > >> 2. For Hisilicon Kunpeng platform:
> > >>    The DMA device support:
> > >>      a) IO coherency: which mean it could read read the latest
> data which may stay the cache, and will
> > >>         invalidate cache's data and write data to DDR when write.
> > >>      b) Order in one request: which mean it only write completion
> descriptor after the copy is done.
> > >>         Note: orders between multiple requests can be implemented
> through the fence mechanism.
> > >>    The DMA driver only should:
> > >>      a) Add one write memory barrier(use lightweight mb) when
> doorbell.
> > >>    So once the DMA is completed the operation is guaranteed
> visible to all cores,
> > >>    And the 3rd core will observed the right order: core-B prepare
> data and issue request to DMA, DMA
> > >>    start work, core-B get completion status.
> > >> 3. I did a TI multi-core SoC many years ago, the SoC don't =
support
> cache coherence and consistency between
> > >>    cores. The SoC also have DMA device which have many channel.
> Here we do a hypothetical design the DMA
> > >>    driver with the DPDK DMA framework:
> > >>    The DMA driver should:
> > >>      a) write back DMA's src buffer, so that there are none cache
> data when DMA running.
> > >>      b) invalidate DMA's dst buffer
> > >>      c) do a full mb
> > >>      d) update DMA's registers.
> > >>    Then DMA will execute the copy task, it copy from DDR and =
write
> to DDR, and after copy it will modify
> > >>    it's status register to completed.
> > >>    In this case, the 3rd core will also observed the right order.
> > >>    A particular point of this is: If one buffer will shared on
> multiple core, application should explicit
> > >>    maintain the cache.
> > >>
> > >> Based on above, I don't think the DMA API should explicit add the
> descriptor, it's driver's and even
> > >> application(e.g. above TI's SoC)'s duty to make sure it.
> > >>
> > > Hi,
> > >
> > > thanks for that. So if I understand correctly, your current HW =
does
> provide
> > > this guarantee, but you don't think it should be always the case
> for
> > > dmadev, correct?
> >
> > Yes, our HW will provide the guarantee.
> > If some HW could not provide, it's driver's and maybe application's
> duty to provide it.
> >
> > >
> > > Based on that, what do you think should be the guarantee on
> completion?
> > > Once a job is completed, the completion is visible to the
> submitting core,
> > > or the core reading the completion? Do you think it's acceptable =
to
> add a
> >
> > Both core will visible to it.
> >
> > > capability flag for drivers to indicate that they do support a
> "globally
> > > visible" guarantee?
> >
> > I think the driver (and with HW) should support "globally visible"
> guarantee.
> > And for some HW, even application (or middleware) should care about
> it.
> >
>=20
> From a dmadev API viewpoint, whether the driver handles it or the HW
> itself, does not matter. However, if the application needs to take
> special
> actions to guarantee visibility, then that needs to be flagged as part
> of
> the dmadev API.
>=20
> I see three possibilities:
> 1 Wait until we have a driver that does not have global visibility on
>   return from rte_dma_completed, and at that point add a flag
> indicating
>   the lack of that support. Until then, document that results of ops
> will
>   be globally visible.
> 2 Add a flag now to allow drivers to indicate *lack* of global
> visibility,
>   and document that results are visible unless flag is set.
> 3 Add a flag now to allow drivers call out that all results are g.v.,
> and
>   update drivers to use this flag.
>=20
> I would be very much in favour of #1, because:
> * YAGNI principle - (subject to confirmation by other maintainers) if
> we
>   don't have a driver right now that needs non-g.v. behaviour we may
> never
>   need one.
> * In the absence of a concrete case where g.v. is not guaranteed, we
> may
>   struggle to document correctly what the actual guarantees are,
> especially if
>   submitter core and completer core are different.

A big +1 to that!

Perhaps the documentation can reflect that global visibility is provided =
by current DMA hardware, and if some future DMA hardware does not =
provide it, the API will be changed (in some unspecified manner) to =
reflect this. My point is: We should avoid that the API stability policy =
makes it impossible to add some future DMA hardware without global =
visibility. Requiring applications to handle such future DMA hardware =
differently is perfectly fine; the API (and its documentation) should =
just be open for it.

>=20
> @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh,
> @Nipun Gupta,
> As driver maintainers, can you please confirm if on receipt of a
> completion
> from HW/driver, the operation results are visible on all application
> cores,
> i.e. the app does not need additional barriers to propagate visibility
> to
> other cores. Your opinions on this discussion would also be useful.
>=20
> Regards,
> /Bruce