Re: OVS DPDK DMA-Dev library/Design Discussion

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: fengchengwen <fengchengwen@huawei.com>
Cc: "Pai G, Sunil" <sunil.pai.g@intel.com>,
	Ilya Maximets <i.maximets@ovn.org>,
	Radha Mohan Chintakuntla <radhac@marvell.com>,
	Veerasenareddy Burru <vburru@marvell.com>,
	Gagandeep Singh <g.singh@nxp.com>,
	Nipun Gupta <nipun.gupta@nxp.com>,
	"Stokes, Ian" <ian.stokes@intel.com>,
	"Hu, Jiayu" <jiayu.hu@intel.com>,
	"Ferriter, Cian" <cian.ferriter@intel.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Maxime Coquelin (maxime.coquelin@redhat.com)"
	<maxime.coquelin@redhat.com>,
	"ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"Mcnamara, John" <john.mcnamara@intel.com>,
	"O'Driscoll, Tim" <tim.odriscoll@intel.com>,
	"Finn, Emma" <emma.finn@intel.com>
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
Date: Fri, 13 May 2022 11:34:14 +0100	[thread overview]
Message-ID: <Yn40JmjvB6KD12lP@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <a993ec37-323e-2e00-a423-3ecfbc3e7b35@huawei.com>

On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote:
> On 2022/5/13 17:10, Bruce Richardson wrote:
> > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote:
> >> On 2022/4/8 14:29, Pai G, Sunil wrote:
> >>>> -----Original Message-----
> >>>> From: Richardson, Bruce <bruce.richardson@intel.com>
> >>>> Sent: Tuesday, April 5, 2022 5:38 PM
> >>>> To: Ilya Maximets <i.maximets@ovn.org>; Chengwen Feng
> >>>> <fengchengwen@huawei.com>; Radha Mohan Chintakuntla <radhac@marvell.com>;
> >>>> Veerasenareddy Burru <vburru@marvell.com>; Gagandeep Singh
> >>>> <g.singh@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> >>>> Cc: Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >>>> <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter, Cian
> >>>> <cian.ferriter@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
> >>>> Maxime Coquelin (maxime.coquelin@redhat.com) <maxime.coquelin@redhat.com>;
> >>>> ovs-dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >>>> Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote:
> >>>>> On 3/30/22 16:09, Bruce Richardson wrote:
> >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote:
> >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote:
> >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote:
> >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote:
> >>>>>>>>>> Forking the thread to discuss a memory consistency/ordering model.
> >>>>>>>>>>
> >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a
> >>>>>>>>>> completely separate PCI device.  However, I don't see any memory
> >>>>>>>>>> ordering being enforced or even described in the dmadev API or
> >>>> documentation.
> >>>>>>>>>> Please, point me to the correct documentation, if I somehow missed
> >>>> it.
> >>>>>>>>>>
> >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively
> >>>>>>>>>> the data and the descriptor info.  CPU core (C) is reading the
> >>>>>>>>>> descriptor and the data it points too.
> >>>>>>>>>>
> >>>>>>>>>> A few things about that process:
> >>>>>>>>>>
> >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss
> >>>>>>>>>>    them?).  Meaning that those operations can be seen by C in a
> >>>>>>>>>>    different order regardless of barriers issued by C and
> >>>> regardless
> >>>>>>>>>>    of the nature of devices A and B.
> >>>>>>>>>>
> >>>>>>>>>> 2. Even if there is a write barrier between A and B, there is
> >>>>>>>>>>    no guarantee that C will see these writes in the same order
> >>>>>>>>>>    as C doesn't use real memory barriers because vhost
> >>>>>>>>>> advertises
> >>>>>>>>>
> >>>>>>>>> s/advertises/does not advertise/
> >>>>>>>>>
> >>>>>>>>>>    VIRTIO_F_ORDER_PLATFORM.
> >>>>>>>>>>
> >>>>>>>>>> So, I'm getting to conclusion that there is a missing write
> >>>>>>>>>> barrier on the vhost side and vhost itself must not advertise
> >>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>> s/must not/must/
> >>>>>>>>>
> >>>>>>>>> Sorry, I wrote things backwards. :)
> >>>>>>>>>
> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual
> >>>>>>>>>> memory barriers.
> >>>>>>>>>>
> >>>>>>>>>> Would like to hear some thoughts on that topic.  Is it a real
> >>>> issue?
> >>>>>>>>>> Is it an issue considering all possible CPU architectures and
> >>>>>>>>>> DMA HW variants?
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> In terms of ordering of operations using dmadev:
> >>>>>>>>
> >>>>>>>> * Some DMA HW will perform all operations strictly in order e.g.
> >>>> Intel
> >>>>>>>>   IOAT, while other hardware may not guarantee order of
> >>>> operations/do
> >>>>>>>>   things in parallel e.g. Intel DSA. Therefore the dmadev API
> >>>> provides the
> >>>>>>>>   fence operation which allows the order to be enforced. The fence
> >>>> can be
> >>>>>>>>   thought of as a full memory barrier, meaning no jobs after the
> >>>> barrier can
> >>>>>>>>   be started until all those before it have completed. Obviously,
> >>>> for HW
> >>>>>>>>   where order is always enforced, this will be a no-op, but for
> >>>> hardware that
> >>>>>>>>   parallelizes, we want to reduce the fences to get best
> >>>> performance.
> >>>>>>>>
> >>>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can
> >>>> only
> >>>>>>>>   write after a DMA copy has been done, the CPU must wait for the
> >>>> dma
> >>>>>>>>   completion to guarantee ordering. Once the completion has been
> >>>> returned
> >>>>>>>>   the completed operation is globally visible to all cores.
> >>>>>>>
> >>>>>>> Thanks for explanation!  Some questions though:
> >>>>>>>
> >>>>>>> In our case one CPU waits for completion and another CPU is
> >>>>>>> actually using the data.  IOW, "CPU must wait" is a bit ambiguous.
> >>>> Which CPU must wait?
> >>>>>>>
> >>>>>>> Or should it be "Once the completion is visible on any core, the
> >>>>>>> completed operation is globally visible to all cores." ?
> >>>>>>>
> >>>>>>
> >>>>>> The latter.
> >>>>>> Once the change to memory/cache is visible to any core, it is
> >>>>>> visible to all ones. This applies to regular CPU memory writes too -
> >>>>>> at least on IA, and I expect on many other architectures - once the
> >>>>>> write is visible outside the current core it is visible to every
> >>>>>> other core. Once the data hits the l1 or l2 cache of any core, any
> >>>>>> subsequent requests for that data from any other core will "snoop"
> >>>>>> the latest data from the cores cache, even if it has not made its
> >>>>>> way down to a shared cache, e.g. l3 on most IA systems.
> >>>>>
> >>>>> It sounds like you're referring to the "multicopy atomicity" of the
> >>>>> architecture.  However, that is not universally supported thing.
> >>>>> AFAICT, POWER and older ARM systems doesn't support it, so writes
> >>>>> performed by one core are not necessarily available to all other cores
> >>>>> at the same time.  That means that if the CPU0 writes the data and the
> >>>>> completion flag, CPU1 reads the completion flag and writes the ring,
> >>>>> CPU2 may see the ring write, but may still not see the write of the
> >>>>> data, even though there was a control dependency on CPU1.
> >>>>> There should be a full memory barrier on CPU1 in order to fulfill the
> >>>>> memory ordering requirements for CPU2, IIUC.
> >>>>>
> >>>>> In our scenario the CPU0 is a DMA device, which may or may not be part
> >>>>> of a CPU and may have different memory consistency/ordering
> >>>>> requirements.  So, the question is: does DPDK DMA API guarantee
> >>>>> multicopy atomicity between DMA device and all CPU cores regardless of
> >>>>> CPU architecture and a nature of the DMA device?
> >>>>>
> >>>>
> >>>> Right now, it doesn't because this never came up in discussion. In order
> >>>> to be useful, it sounds like it explicitly should do so. At least for the
> >>>> Intel ioat and idxd driver cases, this will be supported, so we just need
> >>>> to ensure all other drivers currently upstreamed can offer this too. If
> >>>> they cannot, we cannot offer it as a global guarantee, and we should see
> >>>> about adding a capability flag for this to indicate when the guarantee is
> >>>> there or not.
> >>>>
> >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to
> >>>> document for dmadev that once a DMA operation is completed, the op is
> >>>> guaranteed visible to all cores/threads? If not, any thoughts on what
> >>>> guarantees we can provide in this regard, or what capabilities should be
> >>>> exposed?
> >>>
> >>>
> >>>
> >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
> >>> Requesting your valuable opinions for the queries on this thread.
> >>
> >> Sorry late for reply due I didn't follow this thread.
> >>
> >> I don't think the DMA API should provide such guarantee because:
> >> 1. DMA is an acceleration device, which is the same as encryption/decryption device or network device.
> >> 2. For Hisilicon Kunpeng platform:
> >>    The DMA device support:
> >>      a) IO coherency: which mean it could read read the latest data which may stay the cache, and will
> >>         invalidate cache's data and write data to DDR when write.
> >>      b) Order in one request: which mean it only write completion descriptor after the copy is done.
> >>         Note: orders between multiple requests can be implemented through the fence mechanism.
> >>    The DMA driver only should:
> >>      a) Add one write memory barrier(use lightweight mb) when doorbell.
> >>    So once the DMA is completed the operation is guaranteed visible to all cores,
> >>    And the 3rd core will observed the right order: core-B prepare data and issue request to DMA, DMA
> >>    start work, core-B get completion status.
> >> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache coherence and consistency between
> >>    cores. The SoC also have DMA device which have many channel. Here we do a hypothetical design the DMA
> >>    driver with the DPDK DMA framework:
> >>    The DMA driver should:
> >>      a) write back DMA's src buffer, so that there are none cache data when DMA running.
> >>      b) invalidate DMA's dst buffer
> >>      c) do a full mb
> >>      d) update DMA's registers.
> >>    Then DMA will execute the copy task, it copy from DDR and write to DDR, and after copy it will modify
> >>    it's status register to completed.
> >>    In this case, the 3rd core will also observed the right order.
> >>    A particular point of this is: If one buffer will shared on multiple core, application should explicit
> >>    maintain the cache.
> >>
> >> Based on above, I don't think the DMA API should explicit add the descriptor, it's driver's and even
> >> application(e.g. above TI's SoC)'s duty to make sure it.
> >>
> > Hi,
> > 
> > thanks for that. So if I understand correctly, your current HW does provide
> > this guarantee, but you don't think it should be always the case for
> > dmadev, correct?
> 
> Yes, our HW will provide the guarantee.
> If some HW could not provide, it's driver's and maybe application's duty to provide it.
> 
> > 
> > Based on that, what do you think should be the guarantee on completion?
> > Once a job is completed, the completion is visible to the submitting core,
> > or the core reading the completion? Do you think it's acceptable to add a
> 
> Both core will visible to it.
> 
> > capability flag for drivers to indicate that they do support a "globally
> > visible" guarantee?
> 
> I think the driver (and with HW) should support "globally visible" guarantee.
> And for some HW, even application (or middleware) should care about it.
> 

From a dmadev API viewpoint, whether the driver handles it or the HW
itself, does not matter. However, if the application needs to take special
actions to guarantee visibility, then that needs to be flagged as part of
the dmadev API.

I see three possibilities:
1 Wait until we have a driver that does not have global visibility on
  return from rte_dma_completed, and at that point add a flag indicating
  the lack of that support. Until then, document that results of ops will
  be globally visible.
2 Add a flag now to allow drivers to indicate *lack* of global visibility,
  and document that results are visible unless flag is set.
3 Add a flag now to allow drivers call out that all results are g.v., and
  update drivers to use this flag.

I would be very much in favour of #1, because:
* YAGNI principle - (subject to confirmation by other maintainers) if we
  don't have a driver right now that needs non-g.v. behaviour we may never
  need one.
* In the absence of a concrete case where g.v. is not guaranteed, we may
  struggle to document correctly what the actual guarantees are, especially if
  submitter core and completer core are different.

@Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, @Nipun Gupta,
As driver maintainers, can you please confirm if on receipt of a completion
from HW/driver, the operation results are visible on all application cores,
i.e. the app does not need additional barriers to propagate visibility to
other cores. Your opinions on this discussion would also be useful.

Regards,
/Bruce

next prev parent reply	other threads:[~2022-05-13 10:34 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 15:36 Stokes, Ian
2022-03-28 18:19 ` Pai G, Sunil
2022-03-29 12:51   ` Morten Brørup
2022-03-29 13:01     ` Van Haaren, Harry
2022-03-29 14:44       ` Morten Brørup
2022-03-29 16:24         ` Maxime Coquelin
2022-03-29 16:45           ` Morten Brørup
2022-03-29 17:03             ` Bruce Richardson
2022-03-29 17:13               ` Morten Brørup
2022-03-29 17:45                 ` Ilya Maximets
2022-03-29 18:46                   ` Morten Brørup
2022-03-30  2:02                   ` Hu, Jiayu
2022-03-30  9:25                     ` Maxime Coquelin
2022-03-30 10:20                       ` Bruce Richardson
2022-03-30 14:27                       ` Hu, Jiayu
2022-03-29 17:46                 ` Van Haaren, Harry
2022-03-29 19:59                   ` Morten Brørup
2022-03-30  9:01                     ` Van Haaren, Harry
2022-04-07 14:04                       ` Van Haaren, Harry
2022-04-07 14:25                         ` Maxime Coquelin
2022-04-07 14:39                           ` Ilya Maximets
2022-04-07 14:42                             ` Van Haaren, Harry
2022-04-07 15:01                               ` Ilya Maximets
2022-04-07 15:46                                 ` Maxime Coquelin
2022-04-07 16:04                                   ` Bruce Richardson
2022-04-08  7:13                             ` Hu, Jiayu
2022-04-08  8:21                               ` Morten Brørup
2022-04-08  9:57                               ` Ilya Maximets
2022-04-20 15:39                                 ` Mcnamara, John
2022-04-20 16:41                                 ` Mcnamara, John
2022-04-25 21:46                                   ` Ilya Maximets
2022-04-27 14:55                                     ` Mcnamara, John
2022-04-27 20:34                                     ` Bruce Richardson
2022-04-28 12:59                                       ` Ilya Maximets
2022-04-28 13:55                                         ` Bruce Richardson
2022-05-03 19:38                                         ` Van Haaren, Harry
2022-05-10 14:39                                           ` Van Haaren, Harry
2022-05-24 12:12                                           ` Ilya Maximets
2022-03-30 10:41   ` Ilya Maximets
2022-03-30 10:52     ` Ilya Maximets
2022-03-30 11:12       ` Bruce Richardson
2022-03-30 11:41         ` Ilya Maximets
2022-03-30 14:09           ` Bruce Richardson
2022-04-05 11:29             ` Ilya Maximets
2022-04-05 12:07               ` Bruce Richardson
2022-04-08  6:29                 ` Pai G, Sunil
2022-05-13  8:52                   ` fengchengwen
2022-05-13  9:10                     ` Bruce Richardson
2022-05-13  9:48                       ` fengchengwen
2022-05-13 10:34                         ` Bruce Richardson [this message]
2022-05-16  9:04                           ` Morten Brørup
2022-05-16 22:31                           ` [EXT] " Radha Chintakuntla
  -- strict thread matches above, loose matches on Subject: below --
2022-04-25 15:19 Mcnamara, John
2022-04-21 14:57 Mcnamara, John
     [not found] <DM6PR11MB3227AC0014F321EB901BE385FC199@DM6PR11MB3227.namprd11.prod.outlook.com>
2022-04-21 11:51 ` Mcnamara, John
     [not found] <DM8PR11MB5605B4A5DBD79FFDB4B1C3B2BD0A9@DM8PR11MB5605.namprd11.prod.outlook.com>
2022-03-21 18:23 ` Pai G, Sunil
2022-03-15 15:48 Stokes, Ian
2022-03-15 13:17 Stokes, Ian
2022-03-15 11:15 Stokes, Ian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yn40JmjvB6KD12lP@bricha3-MOBL.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=cian.ferriter@intel.com \
    --cc=dev@dpdk.org \
    --cc=emma.finn@intel.com \
    --cc=fengchengwen@huawei.com \
    --cc=g.singh@nxp.com \
    --cc=harry.van.haaren@intel.com \
    --cc=i.maximets@ovn.org \
    --cc=ian.stokes@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=john.mcnamara@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=nipun.gupta@nxp.com \
    --cc=ovs-dev@openvswitch.org \
    --cc=radhac@marvell.com \
    --cc=sunil.pai.g@intel.com \
    --cc=tim.odriscoll@intel.com \
    --cc=vburru@marvell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).