DPDK patches and discussions
 help / color / mirror / Atom feed
From: Bruce Richardson <bruce.richardson@intel.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: "Mcnamara, John" <john.mcnamara@intel.com>,
	"Hu, Jiayu" <jiayu.hu@intel.com>,
	"Maxime Coquelin" <maxime.coquelin@redhat.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Pai G, Sunil" <sunil.pai.g@intel.com>,
	"Stokes, Ian" <ian.stokes@intel.com>,
	"Ferriter, Cian" <cian.ferriter@intel.com>,
	"ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"O'Driscoll, Tim" <tim.odriscoll@intel.com>,
	"Finn, Emma" <emma.finn@intel.com>
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
Date: Thu, 28 Apr 2022 14:55:15 +0100	[thread overview]
Message-ID: <Ymqcw82xGuNHy4Dy@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <d747600a-6182-5104-ec16-0d16cdba3acf@ovn.org>

On Thu, Apr 28, 2022 at 02:59:37PM +0200, Ilya Maximets wrote:
> On 4/27/22 22:34, Bruce Richardson wrote:
> > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> >> On 4/20/22 18:41, Mcnamara, John wrote:
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>> Sent: Friday, April 8, 2022 10:58 AM
> >>>> To: Hu, Jiayu <jiayu.hu@intel.com>; Maxime Coquelin
> >>>> <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>> <harry.van.haaren@intel.com>; Morten Brørup <mb@smartsharesystems.com>;
> >>>> Richardson, Bruce <bruce.richardson@intel.com>
> >>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes, Ian
> >>>> <ian.stokes@intel.com>; Ferriter, Cian <cian.ferriter@intel.com>; ovs-
> >>>> dev@openvswitch.org; dev@dpdk.org; Mcnamara, John
> >>>> <john.mcnamara@intel.com>; O'Driscoll, Tim <tim.odriscoll@intel.com>;
> >>>> Finn, Emma <emma.finn@intel.com>
> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>> On 4/8/22 09:13, Hu, Jiayu wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Ilya Maximets <i.maximets@ovn.org>
> >>>>>> Sent: Thursday, April 7, 2022 10:40 PM
> >>>>>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Van Haaren, Harry
> >>>>>> <harry.van.haaren@intel.com>; Morten Brørup
> >>>>>> <mb@smartsharesystems.com>; Richardson, Bruce
> >>>>>> <bruce.richardson@intel.com>
> >>>>>> Cc: i.maximets@ovn.org; Pai G, Sunil <sunil.pai.g@intel.com>; Stokes,
> >>>>>> Ian <ian.stokes@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>; Ferriter,
> >>>>>> Cian <cian.ferriter@intel.com>; ovs-dev@openvswitch.org;
> >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnamara@intel.com>; O'Driscoll,
> >>>>>> Tim <tim.odriscoll@intel.com>; Finn, Emma <emma.finn@intel.com>
> >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >>>>>>
> >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> >>>>>>> Hi Harry,
> >>>>>>>
> >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> >>>>>>>> Hi OVS & DPDK, Maintainers & Community,
> >>>>>>>>
> >>>>>>>> Top posting overview of discussion as replies to thread become
> >>>> slower:
> >>>>>>>> perhaps it is a good time to review and plan for next steps?
> >>>>>>>>
> >>>>>>>>  From my perspective, it those most vocal in the thread seem to be
> >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
> >>>>>>>> tradeoff that the application must be aware of handling the async
> >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
> >>>>>>>> this
> >>>>>> method, please indicate this promptly, and we can continue technical
> >>>>>> discussions here now.
> >>>>>>>
> >>>>>>> Wasn't there some discussions about handling the Virtio completions
> >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work.
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> With the virtio completions handled by DMA itself, the vhost port
> >>>>>> turns almost into a real HW NIC.  With that we will not need any
> >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
> >>>>>> while maintaining clear split between rx and tx operations.
> >>>>>
> >>>>> First, making DMA do 2B copy would sacrifice performance, and I think
> >>>>> we all agree on that.
> >>>>
> >>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower than
> >>>> done by CPU, however CPU is going away for dozens or even hundreds of
> >>>> thousands of cycles to process a new packet batch or service other ports,
> >>>> hence DMA will likely complete the transmission faster than waiting for
> >>>> the CPU thread to come back to that task.  In any case, this has to be
> >>>> tested.
> >>>>
> >>>>> Second, this method comes with an issue of ordering.
> >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
> >>>>> first 10 packets is done by dma0, and flag/index update for the left
> >>>>> 20 packets is done by dma1. But there is no ordering guarantee among
> >>>>> different DMA devices, so flag/index update may error. If PMD threads
> >>>>> don't have dedicated DMA devices, which means DMA devices are shared
> >>>>> among threads, we need lock and pay for lock contention in data-path.
> >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> >>>>> sharing among threads. But what's the overhead of allocation mechanism?
> >>>> Who does it? Any thoughts?
> >>>>
> >>>> 1. DMA completion was discussed in context of per-queue allocation, so
> >>>> there
> >>>>    is no re-ordering in this case.
> >>>>
> >>>> 2. Overhead can be minimal if allocated device can stick to the queue for
> >>>> a
> >>>>    reasonable amount of time without re-allocation on every send.  You may
> >>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example of
> >>>>    such mechanism.  For sure it can not be the same, but ideas can be re-
> >>>> used.
> >>>>
> >>>> 3. Locking doesn't mean contention if resources are allocated/distributed
> >>>>    thoughtfully.
> >>>>
> >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote
> >>>>    for doing that inside the vhost library, so any DPDK application and
> >>>>    vhost ethdev can use it without re-inventing from scratch.  It also
> >>>> should
> >>>>    be simpler from the API point of view if allocation and usage are in
> >>>>    the same place.  But I don't have a strong opinion here as for now,
> >>>> since
> >>>>    no real code examples exist, so it's hard to evaluate how they could
> >>>> look
> >>>>    like.
> >>>>
> >>>> But I feel like we're starting to run in circles here as I did already say
> >>>> most of that before.
> >>>
> >>>
> >>
> >> Hi, John.
> >>
> >> Just reading this email as I was on PTO for a last 1.5 weeks
> >> and didn't get through all the emails yet.
> >>
> >>> This does seem to be going in circles, especially since there seemed to be technical alignment on the last public call on March 29th.
> >>
> >> I guess, there is a typo in the date here.
> >> It seems to be 26th, not 29th.
> >>
> >>> It is not feasible to do a real world implementation/POC of every design proposal.
> >>
> >> FWIW, I think it makes sense to PoC and test options that are
> >> going to be simply unavailable going forward if not explored now.
> >> Especially because we don't have any good solutions anyway
> >> ("Deferral of Work" is architecturally wrong solution for OVS).
> >>
> > 
> > Hi Ilya,
> > 
> > for those of us who haven't spent a long time working on OVS, can you
> > perhaps explain a bit more as to why it is architecturally wrong? From my
> > experience with DPDK, use of any lookaside accelerator, not just DMA but
> > any crypto, compression or otherwise, requires asynchronous operation, and
> > therefore some form of setting work aside temporarily to do other tasks.
> 
> OVS doesn't use any lookaside accelerators and doesn't have any
> infrastructure for them.
> 
> 
> Let me create a DPDK analogy of what is proposed for OVS.
> 
> DPDK has an ethdev API that abstracts different device drivers for
> the application.  This API has a rte_eth_tx_burst() function that
> is supposed to send packets through the particular network interface.
> 
> Imagine now that there is a network card that is not capable of
> sending packets right away and requires the application to come
> back later to finish the operation.  That is an obvious problem,
> because rte_eth_tx_burst() doesn't require any extra actions and
> doesn't take ownership of packets that wasn't consumed.
> 
> The proposed solution for this problem is to change the ethdev API:
> 
> 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
>    means that packets was acknowledged, but not actually sent yet.
> 
> 2. Require the application to call the new rte_eth_process_async()
>    function sometime later until it doesn't return -EINPROGRESS
>    anymore, in case the original rte_eth_tx_burst() call returned
>    -EINPROGRESS.
> 
> The main reason why this proposal is questionable:
> 
> It's only one specific device that requires this special handling,
> all other devices are capable of sending packets right away.
> However, every DPDK application now has to implement some kind
> of "Deferral of Work" mechanism in order to be compliant with
> the updated DPDK ethdev API.
> 
> Will DPDK make this API change?
> I have no voice in DPDK API design decisions, but I'd argue against.
> 
> Interestingly, that's not really an imaginary proposal.  That is
> an exact change required for DPDK ethdev API in order to add
> vhost async support to the vhost ethdev driver.
> 
> Going back to OVS:
> 
> An oversimplified architecture of OVS has 3 layers (top to bottom):
> 
> 1. OFproto - the layer that handles OpenFlow.
> 2. Datapath Interface - packet processing.
> 3. Netdev - abstraction on top of all the different port types.
> 
> Each layer has it's own API that allows different implementations
> of the same layer to be used interchangeably without any modifications
> to higher layers.  That's what APIs and encapsulation is for.
> 
> So, Netdev layer has it's own API and this API is actually very
> similar to the DPDK's ethdev API.  Simply because they are serving
> the same purpose - abstraction on top of different network interfaces.
> Beside different types of DPDK ports, there are also several types
> of native linux, bsd and windows ports, variety of different tunnel
> ports.
> 
> Datapath interface layer is an "application" from the ethdev analogy
> above.
> 
> What is proposed by "Deferral of Work" solution is to make pretty
> much the same API change that I described, but to netdev layer API
> inside the OVS, and introduce a fairly complex (and questionable,
> but I'm not going into that right now) machinery to handle that API
> change into the datapath interface layer.
> 
> So, exactly the same problem is here:
> 
> If the API change is needed only for a single port type in a very
> specific hardware environment, why we need to change the common
> API and rework a lot of the code in upper layers in order to accommodate
> that API change, while it makes no practical sense for any other
> port types or more generic hardware setups?
> And similar changes will have to be done in any other DPDK application
> that is not bound to a specific hardware, but wants to support vhost
> async.
> 
> The right solution, IMO, is to make vhost async behave as any other
> physical NIC, since it is essentially a physical NIC now (we're not
> using DMA directly, it's a combined vhost+DMA solution), instead of
> propagating quirks of the single device to a common API.
> 
> And going back to DPDK, this implementation doesn't allow use of
> vhost async in the DPDK's own vhost ethdev driver.
> 
> My initial reply to the "Deferral of Work" RFC with pretty much
> the same concerns:
>   https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-2-cian.ferriter@intel.com/#2751799
> 
> Best regards, Ilya Maximets.

Thanks for the clear explanation. Gives me a much better idea of the view
from your side of things.

/Bruce

  reply	other threads:[~2022-04-28 13:55 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 15:36 Stokes, Ian
2022-03-28 18:19 ` Pai G, Sunil
2022-03-29 12:51   ` Morten Brørup
2022-03-29 13:01     ` Van Haaren, Harry
2022-03-29 14:44       ` Morten Brørup
2022-03-29 16:24         ` Maxime Coquelin
2022-03-29 16:45           ` Morten Brørup
2022-03-29 17:03             ` Bruce Richardson
2022-03-29 17:13               ` Morten Brørup
2022-03-29 17:45                 ` Ilya Maximets
2022-03-29 18:46                   ` Morten Brørup
2022-03-30  2:02                   ` Hu, Jiayu
2022-03-30  9:25                     ` Maxime Coquelin
2022-03-30 10:20                       ` Bruce Richardson
2022-03-30 14:27                       ` Hu, Jiayu
2022-03-29 17:46                 ` Van Haaren, Harry
2022-03-29 19:59                   ` Morten Brørup
2022-03-30  9:01                     ` Van Haaren, Harry
2022-04-07 14:04                       ` Van Haaren, Harry
2022-04-07 14:25                         ` Maxime Coquelin
2022-04-07 14:39                           ` Ilya Maximets
2022-04-07 14:42                             ` Van Haaren, Harry
2022-04-07 15:01                               ` Ilya Maximets
2022-04-07 15:46                                 ` Maxime Coquelin
2022-04-07 16:04                                   ` Bruce Richardson
2022-04-08  7:13                             ` Hu, Jiayu
2022-04-08  8:21                               ` Morten Brørup
2022-04-08  9:57                               ` Ilya Maximets
2022-04-20 15:39                                 ` Mcnamara, John
2022-04-20 16:41                                 ` Mcnamara, John
2022-04-25 21:46                                   ` Ilya Maximets
2022-04-27 14:55                                     ` Mcnamara, John
2022-04-27 20:34                                     ` Bruce Richardson
2022-04-28 12:59                                       ` Ilya Maximets
2022-04-28 13:55                                         ` Bruce Richardson [this message]
2022-05-03 19:38                                         ` Van Haaren, Harry
2022-05-10 14:39                                           ` Van Haaren, Harry
2022-05-24 12:12                                           ` Ilya Maximets
2022-03-30 10:41   ` Ilya Maximets
2022-03-30 10:52     ` Ilya Maximets
2022-03-30 11:12       ` Bruce Richardson
2022-03-30 11:41         ` Ilya Maximets
2022-03-30 14:09           ` Bruce Richardson
2022-04-05 11:29             ` Ilya Maximets
2022-04-05 12:07               ` Bruce Richardson
2022-04-08  6:29                 ` Pai G, Sunil
2022-05-13  8:52                   ` fengchengwen
2022-05-13  9:10                     ` Bruce Richardson
2022-05-13  9:48                       ` fengchengwen
2022-05-13 10:34                         ` Bruce Richardson
2022-05-16  9:04                           ` Morten Brørup
2022-05-16 22:31                           ` [EXT] " Radha Chintakuntla
  -- strict thread matches above, loose matches on Subject: below --
2022-04-25 15:19 Mcnamara, John
2022-04-21 14:57 Mcnamara, John
     [not found] <DM6PR11MB3227AC0014F321EB901BE385FC199@DM6PR11MB3227.namprd11.prod.outlook.com>
2022-04-21 11:51 ` Mcnamara, John
     [not found] <DM8PR11MB5605B4A5DBD79FFDB4B1C3B2BD0A9@DM8PR11MB5605.namprd11.prod.outlook.com>
2022-03-21 18:23 ` Pai G, Sunil
2022-03-15 15:48 Stokes, Ian
2022-03-15 13:17 Stokes, Ian
2022-03-15 11:15 Stokes, Ian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ymqcw82xGuNHy4Dy@bricha3-MOBL.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=cian.ferriter@intel.com \
    --cc=dev@dpdk.org \
    --cc=emma.finn@intel.com \
    --cc=harry.van.haaren@intel.com \
    --cc=i.maximets@ovn.org \
    --cc=ian.stokes@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=john.mcnamara@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=mb@smartsharesystems.com \
    --cc=ovs-dev@openvswitch.org \
    --cc=sunil.pai.g@intel.com \
    --cc=tim.odriscoll@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).